On Sept 17, 1908, Orville Wright and Lt Thomas Selfridge took off in a Wright Flyer from Fort Meyer Virginia. Shortly after take-off, the Wright Flyer suddenly pitched down, driving the aircraft into the ground, injuring Wright and killing Selfridge. The crash occurred when one of the wooden propellers split and pulled on the bracing wires which caused the rear rudder to move from the vertical position to the horizontal position1. This was the first plane crash that resulted in a death. Fast forward approximately 110 years, planes are no longer the simple mechanical aircraft flown by the Wrights and early aviation pioneers, but highly sophisticated, electronic systems driven by millions of lines of software. Advancements over the last century have made air travel the safest mode of transportation.
Recently the news headlines have been dominated by two crashes that involved Boeings new 737 MAX aircraft under similar circumstances within six months of each other. The fallout from these disasters may only be starting as aircraft around the world have been grounded, production of the 737 MAX has been decreased and March sales of the aircraft dropped to zero. The damage to Boeings reputation as a safety leader has now also come into question as investigations have been opened into how the system at the center of the investigations, MCAS, was developed and certified.
The investigations into the sequence of events that led to the loss of these aircraft and the causes will take quite some time to fully come to light and be realized by the accident investigators; However, with the information that has currently been released, embedded systems companies and developers can look at the fiasco that Boeing is currently going through and learn and be reminded of several general lessons that they can apply to their own industries and products. Let’s examine those lessons.
Lesson #1 – Don’t compromise your product to save or make money short-term
There is a normative pressure on businesses and developers today to increase revenue, reduce costs and ship products as fast as possible. The mantra isn’t quality. It isn’t security. It isn’t user friendly. The mantra is maximum short-term growth, and in my opinion, at any cost as long as the short-term growth is maximized. Now, I don’t believe this was Boeings mantra or even their intent but given the pressure that they appeared to be under from customers and shareholders to deliver an aircraft that could compete with the Airbus A319neo, I do believe that we can see that they may have started to cave to this normative pressure.
This brings us to our first lesson, don’t risk compromising your product to save or make more money. It’s important to be successful in the short term, but there is more to every business beyond just how much sales and revenue was generated this quarter and next. Even when the competition releases a competitive product and clients put the pressure on, it’s important to keep the long-term narrative in mind and not sacrifice quality, reputation or put the client’s businesses in jeopardy.
Lesson #2 – Identify and mitigate single points of failure
In any embedded system that is being developed, it’s important to understand the potential failure modes of the system and what effect those failures will have on the system and how they can be mitigated. There are many ways that teams go about doing this including performing a Design Failure & Effects Analysis (DFMEA) which analyzes design functions, failure modes and their effect on the customer or user. Once such an analysis is done, we can then determine how we can mitigate the effect of a failure.
In systems that can affect the safety of a user, it’s common practice to avoid single points of failure such as a faulty sensor or single input. Obviously if a single input suddenly provides junk data, only God knows how that system will respond and when you throw in Murphy’s law, the results are not going to be positive. I was quite literally taken aback when I read that the MCAS system relied on a single sensor for decision making. Having worked on safety critical and robust embedded systems in the past it’s mind boggling to me that the use of a single sensor input would be considered acceptable and adding in the input from a second sensor that would then disable the system if a sensor fails doesn’t seem to make things much better2 (but that really depends on engineering philosophy and culture).
Lesson #3 – Don’t assume your user can handle it
An interesting lesson that I think many engineers can take from the fiasco is that we can’t assume or rely on our users to properly operate our devices, especially if those devices are meant to operate autonomously. I’m not saying that to be derogatory but just to point out that complex systems require more time to analyze and troubleshoot. It seems that Boeing assumed that if an issue arose, the user had enough training, experience and knew the existing procedures well enough to compensate. Right or wrong, as designers, we may need to use “lowered expectations” and do everything we can to protect the user from himself.
Lesson #4 – Highly tested and certified systems have defects
Edsger Dijkstra wrote that “Program testing can be used to show the presence of bugs, but never to show their absence”. We can’t show that a system doesn’t have bugs which means we have to assume that even our highly tested and certified systems have defects. This should change the way that every developer thinks about how they write software. Instead of trying to expose defects on a case by case basis, we should be developing defect strategies that can detect the system is not behaving properly or that something does not seem normal with its inputs. By doing this, we can test as many defects out of our system as possible but when a new one arises in the field, a generic defect mechanism will hopefully be able to detect that something is amiss and take a corrective action.
Lesson #5 – Sensors and systems fail
The fact that sensors and system fail should seem like an obvious statement, but I see quite a few developers who write software as if their microcontroller will never lock-up, encounter a single event upset or have corrupted memory. Sensors will freeze, processors will lock-up, garbage in will produce garbage out. As developers we need to assume that things will go wrong and write the code to handle those cases rather than if we will always have a system that works as well in the field as it does on out lab benches. If you design your system considering the fact that it will fail, you’ll end up with a robust system that has to do a lot of hard work before it finally finds a way to fail (if it ever does).
While it will be months before we have the full reports on what transpired and caused the 737 MAX crashes and results from the congressional hearings as to how the aircraft was certified and developed, we don’t have to wait for those results to draw lessons from them. We’ve examined several important reminders that all businesses and developers need to carefully consider and make sure that they are not treading down a similar path with their own systems. The question you should now be asking is what compromises are you currently making and what actions are you going to take today to make sure they don’t result in your own fiasco tomorrow?
I’m agree with your analysis, the lesson #1 for me is actual big problem.
As an aerospace embedded systems designer, it is difficult for me to imagine that such a system would be designed, tested and produced by a well-respected, long-time manufacturer. There are situations and forces at play here that caused a multi-level system breakdown, from independent certification and safety personnel to in-house system design and system verification personnel. It is hard to fathom the series of events that must have occurred to put Boeing in the position of explaining this fiasco.