How We Monitor Our Infrastructure 24/7
What's actually running behind the scenes to catch problems before customers notice them.
What's actually running behind the scenes to catch problems before customers notice them.
The 99.9% uptime number we advertise isn't a guess — it's based on actual monitoring data. Here's a look at what that monitoring stack looks like and how we use it.
Every game server node gets pinged every 30 seconds from multiple external locations. If a node fails to respond from two or more locations simultaneously, an alert fires immediately. We test not just ICMP reachability but also the actual Pterodactyl Wings API endpoint, because a node can be reachable but have its daemon in a broken state. Both need to be healthy.
We collect CPU load, memory usage, disk I/O, and network throughput from every node continuously. Alerting thresholds are set conservatively — we'd rather get a noisy alert that turns out to be nothing than miss something real. Sustained CPU above 85% for more than 5 minutes fires a warning. Disk usage above 80% on any node fires a ticket automatically.
Traffic baselines are maintained per-node and per-IP range. Significant deviations from baseline — either sudden spikes (potential DDoS) or sudden drops (potential upstream issue) — trigger immediate investigation. The DDoS detection side of this is what feeds into our automated mitigation pipeline.
All of this is meaningless without someone to respond to the alerts. We maintain an on-call rotation so there's always an engineer available to handle escalations regardless of time zone or time of day. Most alerts get resolved without any customer-facing impact, which is the goal.