Embedded systems consist of two very different types of failure rates; hardware and software. The probability of a hardware failure is a well known and understand probability curve which can be observed in Figure 1. When an electronic device is first manufactured there is a short period of time where the probability of a failure is “high”. Many manufacturers will operate the hardware during this initial period in an attempt to burn in the new device. Only devices that survive burn in are shipped and enter the useful life period of the product. Failed devices may be reworked and eventually deployed once they successfully burn in.
The useful life of a hardware product will have a constant and hopefully low probability of failure until the device enters the wear out period. The wear out period starts to see an increase in the probability of a hardware failure due to ratings and expected lifetimes of physical parts such as passive and active components.
Figure 1 – Hardware failure rate probability
The hardware failure rate probability is relatively flat once the hardware enters the useful life stage. Software on the hand various drastically as can be seen in Figure 2. Software starts out with a high probability in the Test/Debug phase which is equivalent to the hardware burn in phase. As the software is tested, the probability of a software failure starts to decrease until the rate is at an acceptable enough level for the product to ship and enter its useful life.
Figure 2 – Software failure rate probability
Unlike hardware, software can be updated at any time in the field. Every firmware update to a device causes a spike in the probability that a failure will occur in the system. There are many reasons for a spike in the failure rate after an update such as
- Failure to successfully complete the update
- New features not being fully tested
- Regression testing not being completed
- Different states of the devices in the field
- Security vulnerabilities
- Not spending enough time testing
The list could go on and on since there are many reasons but the most prevalent is no doubt a failure to fully test the software update. A great example of that recently occurred was the Nest Thermostat update that caused over 100,000 Nest devices in the U.S. to stop working in January. A firmware update caused the battery to be drained in the device and left owners without a working thermostat over night in the middle of winter.
Software is complex and continuously evolves. Developers and managers often neglect to take into account the probability of failure for software. Updating their devices often with minimal testing. The chances are that as soon as you update that product, bugs or other unexpected issues will crop up. Developers need to be diligent in performing regression testing and making absolutely certain that a software update is ready to be deployed. Below are a few thoughts on what developers should do to help decrease the failure rate probability curve.
- Make small incremental updates
- Perform full regression testing
- Run updates on more than a single device. Preferably at least one from each manufactured batch
- Run the updated devices for more than 72 hours to verify continued correct operation
- Use an automated test system to continuously operate and control the device during testing