June 15, 2021 Outage Postmortem

On June 15, we had roughly 64 minutes of downtime (between 01:05UTC and 02:09UTC) on our US servers which took down a majority of our services.

This was caused by a power outage one of the legs (“A-side”) on the rack said services were located on. Other people on the same location have also reported the same issue:

The power provided to the rack comes in two legs (“A-side” and “B-side”), and most of the equipment on the rack is connected to both legs for redundancy. The router is among the few equipment on the rack without a redundant PSU (and definitely the most critical one), and at the time of the power cut it was connected to the leg that went down.

Indeed, when the power cut happened, most of the servers actually stayed up, but without a working network connection. This was also visible afterwards on our energy use graphs on our B-side PDU:

B-side PDU graphs showing a significant spike from ~9A to ~16.5A for roughly an hour

As this issue affected our services solely because of lack of redundant PSUs on the router, options for replacing it are being researched as we speak. More updates will be provided once it is replaced.