Watchdogs are an important part of an embedded system.
A number of years ago I performed a brief survey of the state of watchdogs with small satellites in mind. One of the primary factors of watchdog design for small satellites was that the system had to be highly reliable. Few space systems can be serviced while in flight with perhaps the exception of the Hubble Space Telescope. The great part about designing a watchdog that is highly reliable is that whether you are designing a spacecraft, medical device or toaster oven, best practices don’t care what industry the watchdog is used for.
The push towards ”faster, better, cheaper” missions with an emphasis on design reuse has led to the popularity of COTS products that have traditionally relied upon custom software and hardware . The motivation for moving towards COTS hardware is faster, cheaper and more powerful processors with lower power consumption compared to those available in rad hard versions . While moving towards COTS components has many advantages, it also requires developing mechanisms to protect the components from radiation harsh environments which high altitude and space systems are subject to.
There are two primary methods to deal with the effects of radiation on an embedded system. The first, fault avoidance, can be defined as using more robust or radiation hardened components to avoid single event upsets caused by cosmic rays  and other high energy particles. While this approach ensures system reliability, it does carry with it several disadvantages. Primarily, fault avoidance increases system power usage, costs and lead times while decreasing computational power. Power requirements cause a trade off between higher voltage components and faster system clocks. Power consumption rises as the components voltage increases and as the system clock increases in frequency. When components are selected that are not in high demand by markets such as the consumer electronics industry, the cost of the components rise due to a lack of demand which also has the effect of increasing the lead time of the components.
The alternative method to fault avoidance is fault detection. Even the most robust fault avoidance systems will eventually have a fault . While fault avoidance systems aim to increase the mean time of failure to beyond the mission life, fault detection meth- ods detect when the fault occurs  in order to handle the fault and limit the system downtime. There are a number of advantages to fault tolerant systems such as low power consumption, increased computational power and decreased hardware costs. However these advantages can lead to increased hardware complexity (in order to protect the components) and increased software complexity depending on the type of fault tolerant architectures used.
When developing a system based on the fault detection method, one of the most cost effective methods of detecting and handling faults is the watchdog. A watchdog is a subsystem which monitors the operation of the system and in the event a fault or an unknown state occurs, the watchdog can restart the system or put the system into a known state from which the system can recover. In this paper, we will examine common watchdog architectures which have been used in terrestrial and space systems and develop a systematic approach to selecting the architecture which is most effective and developing a general strategy on how they can be used to improve system reliability in cubesats. We will then examine how this approach was used to develop a fault detection method for the Radio Aurora Explorer (RAX) nanosatellite.
Download the entire paper