At 20:31 UTC on Nov 15, 2023, a row in one of our data centers in our us-northcentral1 region experienced a power failure which affected a percentage of VMs in that region. Our teams identified the affected VMs, and worked to restart them as soon as power was restored. The final VM was restarted at 00:20 UTC on Nov 16, 2023.
The power failure occurred during routine maintenance of the UPS in that row. To perform the maintenance, the UPS was put in "bypass" mode, which allows it to be worked on without interrupting power delivery to the row. During that maintenance period, due to unrelated reasons, power switched from our primary source to a backup source. Without the UPS to regulate this power switch, the row lost power and all servers powered off, shutting down the VMs running inside them.
We are reviewing our standard operating procedures related to UPS maintenance to ensure the lowest probability of customer impact during future maintenance events. Our future data centers will also include redundant UPSs so we will not be in this situation again. Additionally, we will continue to improve our recovery procedures to further reduce Mean Time To Resolution (MTTR) in the event of any future issues.
We recognize that any service interruption causes significant disruption to your operations, and we apologize for this inconvenience.