3 Strategies for Handling Errors in Embedded Software

It’s very tempting to assume that a system is going to behave as perfectly in the field as it does on the engineering bench. During development, embedded software is written under the best of conditions. The developer knows, or at least has their own concept of, how the system is supposed to work. Things usually work quite smoothly, but as thousands of devices start to get in the hands of users, the chances that the unexpected will happen, and errors will occur becomes statistically likely. In today’s post, let’s explore the strategies that developers need to have to write software that can handle unexpected errors.

Strategy #1 – Constantly consider what can go wrong

The first strategy that developers need to deploy to handle errors is to actively question what can go wrong as they write every single line of code. For example, the moment I write the implementation for a function like:

void Dio_WriteChannel(DioChannel_t Channel, bool state)

{

// Additional code goes here

}

I’m asking myself several questions:

What happens if the Channel parameter goes out of range?
Should the function be returning an error code or success flag?
How do I validate that the desired channel state has changed?
What do I do if the state tried to change but can’t?
Can memory become corrupted such that my bool state variable is something other than true or false? If so, how do I handle that?
Is an assert enough to check boundary conditions at development time or should there be real-time checks on the parameters as well?

That’s a lot of questions or such as simple and common code block and we really haven’t even started to fill in the details yet! If you want to be able to handle errors though, you have to be constantly questioning the code and what could go wrong.

Strategy #2 – Document your concerns and questions using TODO

As software is developed, there are sometimes more questions than there are answers for at the moment. In the above example, there may not be an answer yet for how return errors are going to be handled. It would be really easy to just let it be for now, but then with other fires undoubtedly cropping up, the question would then be forgotten in the noise.

As I’m writing my software, one technique I use, which is probably more tactical than strategic, is to sprinkle my concerns or questions in the code comments. Most modern IDE’s will have custom tags that can be pulled from the code to create a list such as the use of TODO. These will show up as informational messages. If there is an error that needs to be handled but I’m not sure how to do it at that point, I will use TODO. If there is an implementation, but I want to review it, I will probably use a TODO but maybe some other keyword that I could easily search the code for. There needs to be some care taken to not overload the TODO informational messages otherwise it becomes too noisy, but we also want to make sure we don’t lose our question or issue. (Yes, external trackers can be used but I find it’s far easier to keep it with the code so it can be seen easily in code reviewers and other developers).

Strategy #3 – Lose the “I’ll go back later attitude”

I can’t tell you how many times I’ve been told “We know this isn’t the right way to do this, but we will go back later and fix it!”. I’ve heard this from entry level engineers and engineers with decades of experience who should know better. There is no better time than the present to fix something, document it or implement error checking. There is just always a fire or some issue that is fighting for developer attention and while we always have the best intentions to go back and add that error handling, it’s never going to happen!

As soon as something appears to work to management, it’s time to move on to the next pressing issue. If it’s working, why would you invest more time in it for diminishing returns? Management doesn’t realize that you didn’t include error checking or that there are gaping holes in the implementation! If the product needs robustness, don’t try to add it later or believe that you can go back later and fix it. Do what needs to be done while you write the code and then you can sleep better at night knowing there isn’t an error in hiding waiting to ruin your week.

Conclusions

The way that a developer approaches writing their software is what determines whether their system will recover from errors gracefully or whether it will metaphorically blowup in their users face. The key is having the right development attitude that considers what can go wrong and implementing the recovery mechanism while the software is written. I often hear teams say they will make it work and go back later to clean it up and handle errors. It rarely happens and the result is deploying a disaster that is just waiting to happen. In order to get a better handle on errors, deal with what can go wrong in the moment because otherwise it will never be handled.