Certain VMs in us-northcentral1-a are unavailable

Incident Report for Crusoe Cloud

Postmortem

At 20:31 UTC on Nov 15, 2023, a row in one of our data centers in our us-northcentral1 region experienced a power failure which affected a percentage of VMs in that region. Our teams identified the affected VMs, and worked to restart them as soon as power was restored. The final VM was restarted at 00:20 UTC on Nov 16, 2023.

The power failure occurred during routine maintenance of the UPS in that row. To perform the maintenance, the UPS was put in "bypass" mode, which allows it to be worked on without interrupting power delivery to the row. During that maintenance period, due to unrelated reasons, power switched from our primary source to a backup source. Without the UPS to regulate this power switch, the row lost power and all servers powered off, shutting down the VMs running inside them.

We are reviewing our standard operating procedures related to UPS maintenance to ensure the lowest probability of customer impact during future maintenance events. Our future data centers will also include redundant UPSs so we will not be in this situation again. Additionally, we will continue to improve our recovery procedures to further reduce Mean Time To Resolution (MTTR) in the event of any future issues.

We recognize that any service interruption causes significant disruption to your operations, and we apologize for this inconvenience.

If you believe that this downtime results in an SLA violation, please contact support.

Posted Nov 16, 2023 - 20:34 UTC

Resolved

We have returned service to all affected customer VMs. We appreciate your patience as we worked to get this incident resolved. Please reach out to support if you have any questions or concerns.

Posted Nov 15, 2023 - 23:38 UTC

Update

We returned service to most of the affected VMs, and all remaining VMs are expected to return within the hour. We will provide another update as soon as all service is restored.

Posted Nov 15, 2023 - 23:05 UTC

Update

We have returned service to a subset of affected VMs. We will continue to restart VMs and provide another update by 23:00 UTC (15:00 PT).

Posted Nov 15, 2023 - 22:32 UTC

Update

We will begin restarting VMs shortly. We will provide another update by 22:30 UTC (14:30 PT)

Posted Nov 15, 2023 - 22:00 UTC

Update

We have identified the affected VMs and are working to restore service. We will provide another update by 22:00 UTC (14:00 PT)

Posted Nov 15, 2023 - 21:31 UTC

Identified

Certain VMs in us-northcentral1-a are unavailable; we are working on restoring service to these VMs and will update again by 21:30 UTC (13:30 PT).

Posted Nov 15, 2023 - 21:07 UTC

This incident affected: GPU Virtual Machines (us-northcentral1).