Embedded software error handling is something that teams often struggle with. It’s very tempting to assume that a system will behave as perfectly in the field as it does on the engineering bench. However, embedded software is written under the best of conditions during development. The developer knows their own concept of how the system is supposed to work. Things usually work quite smoothly, but as thousands of devices start to get in the hands of users, the chances that the unexpected will happen and errors will occur becomes statistically likely. In today’s post, let’s explore several strategies developers need to write software that can handle unexpected errors.
Strategy #1 – Constantly consider what can go wrong
The first strategy developers need to deploy to handle errors is to actively question what can go wrong as they write every single line of code. For example, the moment I write the implementation for a function like:
void Dio_WriteChannel(DioChannel_t Channel, bool state)
// Additional code goes here
I’m asking myself several questions:
- What happens if the Channel parameter goes out of range?
- Should the function return an error code or success flag?
- How do I validate that the desired channel state has changed?
- What do I do if the state tries to change but can’t?
- Can memory become corrupted such that my bool state variable is something other than true or false? If so, how do I handle that?
- Is an assert enough to check boundary conditions at development time, or should there be real-time checks on the parameters?
That’s a lot of questions for such a simple and common code block. The details of the function can result in far more complex questions! If you want to handle errors, though, you have to be constantly questioning the code and what could go wrong.
Strategy #2 – Document your concerns and questions using TODO
As software is developed, there are sometimes more questions than there are answers for at the moment. In the above example, there may not be an answer yet for how return errors will be handled. It would be straightforward just to let it be for now, but then with other fires undoubtedly cropping up, the question would be forgotten in the noise.
As I’m writing my software, one technique I use, which is probably more tactical than strategic, is to sprinkle my concerns or questions in the code comments. Most modern IDE’s will have custom tags that can be pulled from the code to create a list, such as TODO. These will show up as informational messages. If there is an error that needs to be handled, but I’m not sure how to do it at that point, I will use TODO. If there is an implementation, but I want to review it, I will probably use a TODO but maybe some other keyword that I could easily search the code for. There needs to be some care taken not to overload the TODO informational messages; otherwise, it becomes too noisy, but we also want to make sure we don’t lose our question or issue. (Yes, external trackers can be used, but I find it’s far easier to keep it with the code so it can be seen easily in code reviewers and other developers).
Strategy #3 – Lose the “I’ll go back later attitude.”
I can’t tell you how many times I’ve been told, “We know this isn’t the right way to do this, but we will go back later and fix it!”. I’ve heard this from entry-level engineers and engineers with decades of experience who should know better. There is no better time than the present to fix something, document it, or implement error checking. There is just always fire or some issue that is fighting for developer attention, and while we always have the best intentions to go back and add that error handling, it’s never going to happen!
As soon as something appears to work for management, it’s time to move on to the next pressing issue. If it’s working, why would you invest more time in it for diminishing returns? Management doesn’t realize that you didn’t include error checking or that there are gaping holes in the implementation! If the product needs robustness, don’t try to add it later or believe that you can go back later and fix it. Instead, do what needs to be done while you write the code, and then you can sleep better at night knowing there isn’t an error in hiding waiting to ruin your week.
The way a developer approaches writing their software determines whether their system will recover from errors gracefully or whether it will metaphorically blow up in their users’ faces. The key is having the right development attitude that considers what can go wrong and implementing the recovery mechanism while the software is written. I often hear teams say they will work and go back later to clean it up and handle errors. Unfortunately, it rarely happens, and the result is deploying a disaster that is just waiting to happen. To better handle errors, deal with what can go wrong in the moment because otherwise, it will never be handled.