If you are failing to plan, you are planning to fail — Learning outcomes from Texas power grid failure!
The technology section of The Atlantic published an article on ‘what went wrong?’ explaining the three fundamental errors that led to infrastructure failure leaving millions of people in Texas shivering in some of the coldest days in decades. Lessons learnt from this failure are closely applicable to all things related to technology. Whether you are developing a public-facing web application or responsible for backend connecting several microservices with complex multi-tier APIs, failure is inevitable. Here is my analysis of key learnings from Texas’s power grid infrastructure failure that can help in automated disaster recovery of your technology apps:
- Scalability issue — Not enough stress and load testing: Texas power grid hit the scalability issue. It was not designed to handle a load of total power generation beyond a certain limit as colder temperatures < 50F was not considered.
With the changes in requirement and growing need, the system must adapt and work accordingly. Scalability is an important indicator in distributed computing and parallel computing. It describes the ability of the system to dynamically adjust its own computing performance by changing available computing resources and scheduling methods. Scalability is divided into two aspects: hardware and software. Scalability in hardware refers to changing workloads by changing hardware resources, such as changing the number of processors, memory, and hard disk capacity. The scalability of software is to meet the changing workload by changing the scheduling method and the degree of parallelism.
Scalability testing ensures that the system meets the growing need when there is any change in terms of size and volume, making the system, process, or network to function well. Most of the highly anticipated launches fail on the first day (such as Disney plus case: https://www.cnbc.com/2019/11/20/disney-exec-explains-why-disney-plus-crashed-on-its-first-day.html) due to scalability issue. Scalability testing involves load and stress testing:
- It tells you about your application’s behaviour with the increasing load
- Determines the limitation of the web application in terms of users
- Determines the end-user experience under load
- Determines the robustness of the server
2. No failover plan for generating an alternative source of energy: Texas power grid relies on wind, solar and natural gas for generating electricity. When energy through wind and solar failed (10% of total energy generation), fossil fuel (natural gas) didn’t have enough run to continue supplying power to keep homes warm.
Failover testing is something that I learnt from my manager at D-Wave. He formally introduced me to the concept of failover testing as my team is responsible and accountable for 23+ production releases every year which includes LEAP (quantum cloud platform as a service), hybrid solvers (used for solving large variable quantum problems) and several other distributed components.
Failover testing is a technique to check a system’s ability to provide extra resources and the ability to move to back-up systems during the system’s failure due to one or the other reasons. This is also known as ‘reliability testing’ or ‘redundancy mechanism testing’. For e.g. Some of the key questions that we try to answer during failover testing at D-Wave:
What if one of the hardware components fail for e.g. CPU/GPU or quantum solver stop responding? What if you run out of memory for problem storage in Redis? What happens to users’ jobs if the network pipe is choked? Is there a graceful restart of services? Do we lose any data during the process of failing over to secondary backups/components/services?
It’s not easy to prepare the failover list unless you have already been in that situation. It requires a great vision or training to foresee that single massive crisis that could bring down your entire system. For us at D-Wave, failover testing is a part of our production release checklist for new solvers and every failure is an opportunity to make our checklist better.
3. No planning for the shortage of natural (fossil) resources: As per Atlantic: The natural aspect of this disaster had a precedent: Although Texas saw brutal temperatures this week, they were within the historical norm.
When planning for contingency during failover:
- Look for the precedent levels which could trigger the failover
- Have an independent authority to review the failover plan. We become complacent over time. Complacency leads to the worst failover planning. Also, “we have always done it that way” mindset curbs the spirit to challenge the failover plan.
- If not possible to arrange for failover secondary components, at least have a document listing plan A, plan B and plan C (or unless all the possible options are exhausted)
- Have a good monitoring system with tight threshold monitoring alarms. For e.g. start sending monitoring notifications when the system is at 50% capacity, start preparing secondary back up with 65% capacity and heavy the secondary system ready to take over anytime it’s over 75%. The threshold levels can be tuned as per the behaviour or usage of the system.
Something as critical as the power grid system, which is modern society’s life support system — we should identify the critical components required for the functioning of production apps and rank them. Test the life out of those components using the unit, functional (integration), performance, load and stress testing. Create a failover plan and list the failure trigger conditions. Trigger the failure conditions and test the failover on secondary systems (back up). Monitor the failure thresholds and bring back the secondary systems often (as a dry run).
I hope Texans get their life back to normal soon and we all can learn from the lessons learnt from the power grid failure to avoid any future (un)natural disasters.