How We Monitor Our Infrastructure 24/7

The 99.9% uptime number we advertise isn't a guess — it's based on actual monitoring data. Here's a look at what that monitoring stack looks like and how we use it.

Synthetic checks

Every game server node gets pinged every 30 seconds from multiple external locations. If a node fails to respond from two or more locations simultaneously, an alert fires immediately. We test not just ICMP reachability but also the actual Pterodactyl Wings API endpoint, because a node can be reachable but have its daemon in a broken state. Both need to be healthy.

Metrics collection

We collect CPU load, memory usage, disk I/O, and network throughput from every node continuously. Alerting thresholds are set conservatively — we'd rather get a noisy alert that turns out to be nothing than miss something real. Sustained CPU above 85% for more than 5 minutes fires a warning. Disk usage above 80% on any node fires a ticket automatically.

Our average time from alert to engineer acknowledgement is under 4 minutes, around the clock.

Network monitoring

Traffic baselines are maintained per-node and per-IP range. Significant deviations from baseline — either sudden spikes (potential DDoS) or sudden drops (potential upstream issue) — trigger immediate investigation. The DDoS detection side of this is what feeds into our automated mitigation pipeline.

The on-call rotation

All of this is meaningless without someone to respond to the alerts. We maintain an on-call rotation so there's always an engineer available to handle escalations regardless of time zone or time of day. Most alerts get resolved without any customer-facing impact, which is the goal.

← Back to Blog