Why We Migrated Everything to NVMe
Autosave lag spikes were our number one support complaint. Switching storage fixed almost all of them.
Autosave lag spikes were our number one support complaint. Switching storage fixed almost all of them.
If you ran a Minecraft server on our old infrastructure and noticed periodic lag spikes every few minutes, you probably know what autosave lag feels like. The server freezes for half a second, everyone's ping spikes, chunks stop loading. It's one of the most common complaints in Minecraft hosting and it's almost always a storage problem.
Minecraft's world save process is synchronous by default in vanilla — when the server saves, everything else waits. Even with async save implementations like Paper's, you still need the underlying storage to be fast enough to keep up with the I/O burst. On spinning disk, that's often the bottleneck. On SATA SSD it's better but still not ideal under load. On NVMe it largely stops being a problem.
The older nodes had a mix of storage — some SATA SSDs for active game servers, some HDDs that were supposed to be retired but kept getting extended. Not our finest hour, honestly. The HDDs in particular were causing measurable issues on servers with large world files and active player counts. Random I/O on spinning rust under concurrent load is painful.
We couldn't just swap drives on live nodes, so we provisioned the new NVMe hardware in parallel and migrated servers across in batches. Each migration involved a brief planned downtime of about 90 seconds while the server was stopped, data was synced across, and the server was started on the new node. We scheduled these during off-peak hours and notified affected customers in advance.
Support tickets about lag spikes dropped noticeably after the migration completed. Not to zero — there are other causes of Minecraft lag beyond storage — but the storage-related ones essentially disappeared. That's the outcome we were after.
All new servers provision on NVMe nodes by default now. The old SATA and HDD hardware has been retired or repurposed for non-latency-sensitive workloads.