3 Strategies for Embedded Software Error Handling

Embedded software error handling is something that teams often struggle with. It’s very tempting to assume that a system will behave as perfectly in the field as it does on the engineering bench. However, embedded software is written under the best of conditions during development. The developer knows their own concept of how the system is supposed to work. Things usually work quite smoothly, but as thousands of devices start to get in the hands of users, the chances that the unexpected will happen and errors will occur becomes statistically likely. In today’s post, let’s explore several strategies developers need to write software that can handle unexpected errors.

Strategy #1 – Constantly consider what can go wrong

The first strategy developers need to deploy to handle errors is to actively question what can go wrong as they write every single line of code. For example, the moment I write the implementation for a function like:

void Dio_WriteChannel(DioChannel_t Channel, bool state)

{

// Additional code goes here

}

I’m asking myself several questions:

  • What happens if the Channel parameter goes out of range?
  • Should the function return an error code or success flag?
  • How do I validate that the desired channel state has changed?
  • What do I do if the state tries to change but can’t?
  • Can memory become corrupted such that my bool state variable is something other than true or false? If so, how do I handle that?
  • Is an assert enough to check boundary conditions at development time, or should there be real-time checks on the parameters?

That’s a lot of questions for such a simple and common code block. The details of the function can result in far more complex questions! If you want to handle errors, though, you have to be constantly questioning the code and what could go wrong.

Strategy #2 – Document your concerns and questions using TODO

As software is developed, there are sometimes more questions than there are answers for at the moment. In the above example, there may not be an answer yet for how return errors will be handled. It would be straightforward just to let it be for now, but then with other fires undoubtedly cropping up, the question would be forgotten in the noise.

As I’m writing my software, one technique I use, which is probably more tactical than strategic, is to sprinkle my concerns or questions in the code comments. Most modern IDE’s will have custom tags that can be pulled from the code to create a list, such as TODO. These will show up as informational messages. If there is an error that needs to be handled, but I’m not sure how to do it at that point, I will use TODO. If there is an implementation, but I want to review it, I will probably use a TODO but maybe some other keyword that I could easily search the code for. There needs to be some care taken not to overload the TODO informational messages; otherwise, it becomes too noisy, but we also want to make sure we don’t lose our question or issue. (Yes, external trackers can be used, but I find it’s far easier to keep it with the code so it can be seen easily in code reviewers and other developers).

Strategy #3 – Lose the “I’ll go back later attitude.”

I can’t tell you how many times I’ve been told, “We know this isn’t the right way to do this, but we will go back later and fix it!”. I’ve heard this from entry-level engineers and engineers with decades of experience who should know better. There is no better time than the present to fix something, document it, or implement error checking. There is just always fire or some issue that is fighting for developer attention, and while we always have the best intentions to go back and add that error handling, it’s never going to happen!

As soon as something appears to work for management, it’s time to move on to the next pressing issue. If it’s working, why would you invest more time in it for diminishing returns? Management doesn’t realize that you didn’t include error checking or that there are gaping holes in the implementation! If the product needs robustness, don’t try to add it later or believe that you can go back later and fix it. Instead, do what needs to be done while you write the code, and then you can sleep better at night knowing there isn’t an error in hiding waiting to ruin your week.

Conclusions

The way a developer approaches writing their software determines whether their system will recover from errors gracefully or whether it will metaphorically blow up in their users’ faces. The key is having the right development attitude that considers what can go wrong and implementing the recovery mechanism while the software is written. I often hear teams say they will work and go back later to clean it up and handle errors. Unfortunately, it rarely happens, and the result is deploying a disaster that is just waiting to happen. To better handle errors, deal with what can go wrong in the moment because otherwise, it will never be handled.

3 thoughts on “3 Strategies for Embedded Software Error Handling”

  1. Jacob — this article isn’t so much about how to do error handling as it is how to ensure it isn’t handled — and you forgot the most popular — ignore the error codes and assume it will work! (No, please no!)

    I like the offensive techniques suggested by Tyler Hoffman (https://interrupt.memfault.com/blog/defensive-and-offensive-programming) and Dr. Miro Samek (https://www.state-machine.com/dbc).
    Ensure it’s going to fail unless the input is good. This should get a LOT of the errors out of the code in testing and development — but by ensuring detection in production, even those one-in-a-million corner cases due to timing, age, or wear.

    These are interesting concepts to consider in a regulated medical device, for example — where how you (gracefully) handle the error may hold the patients life in the balance.

    As always, the hardest part is ensuring that the detection can be reported and addressed — even if the device MUST continue to function.

  2. Good article on something is too often left for ‘later’ as you suggest – hence all the bad products out there that don’t work well. This is just a real problem in embedded systems due to the lack of a ‘reporting method’ in very small systems. And sadly, it is a big problem also on PC’s with unlimited resources.

  3. Hi Jacob,

    when it comes to TODO-s within the source files, I suggest pairing them with VCS and issue tracker.

    Namely, whenever (in git terminology) you push to the repository, have a hook scan changed code for “TODO” and create an issue automatically.

    You can structure your TODO-s text in the source files in a way that it is easy to parse and extract subject and description. Then you put those in the issue tracker automatically. For example, it’s possible to work with Jira from Python to do exactly that.

    Better yet, have the hook _update_ source files with the returned ticket ID.
    This way one gets a non-obtrusive way of documenting what needs to be improved. Switching between the code editor and browser is no fun and tends to lower the focus.

    The whole point behind all this effort is that TODO-s spread within code base tend to be forgotten. Issue tracker can help with this.

    On the topic of fast paced features development vs code quality (including error handing), I would love to point at an recent and excellent write up by Tim Cochran and Carl Nygard:

    https://martinfowler.com/articles/bottlenecks-of-scaleups/01-tech-debt.html

    Very well worth reading! 🙂


    Best regards,
    Andrzej Telszewski

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.