Despite the hopes and dreams of many embedded engineers, reliable code doesn’t happen by accident. It is a painstaking process that requires developers to maintain and manage every bit and byte of the system. There is usually a sigh of relief when an application is validated “successfully” but just because the software is running correctly in that moment under controlled conditions doesn’t mean that it will tomorrow or a year from now.
There is a plethora of techniques for creating a reliable embedded system ranging from a well-disciplined development cycle through strict implementation and system checking. An entire library could easily be filled with books on reliable software design but there are seven tips that are easily implemented that can go a long way to ensure that a system performs more reliably and catches unexpected behavior.
Tip #1 – Fill ROM with known value
Software developers tend to be a very optimistic group at least as far as how faithfully their microcontroller will run their code over time. The thought of the microcontroller jumping out of the application space and executing code seems like a rare case; however, the opportunity for this to occur is nothing more than a buffer overflow or the dereferencing of a faulty pointer away. It can and DOES happen! The resulting behavior of the system would be undefined since memory could have all 0xFF’s in the space by default or since the region of memory normally isn’t written the values could have decayed into only God knows what.
There is a neat linker or IDE trick that can be used to help identify and recover the system from just such an event. The trick is to use the FILL command to fill unused ROM with a known bit pattern. There are many different possible combinations of what can be used to fill the unused memory with but if the intent is to build a more reliable system the obvious choice is to place an ISR fault handler in this location. If something goes wrong and the processor starts to execute code outside of program space then the ISR will fire providing the opportunity to store the state of the processor, registers and system before deciding on a corrective course of action.
Additional information on how to use FILL and alternative strategies for its use can be found in “Improving Code Integrity Using FILL” located here.
Tip #2 – Check Application CRC
One of the great tools available to embedded engineers is that our IDE’s and tool chains can automatically generate an application or memory space checksum from which the application can be verified. The interesting thing is that in many of these cases it is used only at the time of loading program code onto a device! If a CRC or checksum is in memory, then verifying that the application is still intact at start-up (or even periodically for long running systems) is a great way to ensure that something unexpected won’t occur. Now the chances that a programmed application will change is small but considering the billions of microcontrollers shipped each year and the possible harsh operating environments the chances of a corrupted application is not zero. Even more likely however is that a bug in the system could cause a flash write or flash erase in a sector resulting in a corrupted application.
Tip #3 – Perform a RAM Check on Start-up
Verifying that there are no issues with RAM either internal or external on start-up is a great way to ensure that the hardware is functioning as expected. There are many different methods that can be used to perform a RAM check but commonly what is done is a known pattern is written, allowed to sit for a short period and then read back. The result should be that what is read matches what is written. The most common result is obviously that everything is working as expected but in the off chance that it isn’t this provides an excellent opportunity for the system to flag that there is a hardware issue!
The truth is that in most cases the RAM check will pass which is what we want but in order to build a more reliable and robust system it is important to ensure that the system hardware is functioning. After all, hardware does fail! Thankfully software never fails it just does what it was coded to do whether right or wrong. There is a memtest C module that was written back in 2000 by Michael Barr that will save an engineer time when considering a RAM test. The embedded.com link to download the module can be found here.
Tip #4 – Use a Stack Monitor
To many embedded developers, the stack seems to be quite the mystical force. Strange things start to happen and when the engineer is finally stumped, they begin to think well maybe something is going on with the stack. The result is blind tweaking and adjustments of the stack size, position, etc. Often enough the bug has nothing to do with the stack but how can one really be sure? After all, how many engineers perform a worst-case stack size analysis?
The size of the stack is allocated statically at compile time, but it is used in a dynamic way. As code is executed variables, return addresses and other information are stored on the stack that the application needs. This causes the stack to grow within its allocated memory; however, this growth can exceed the compile time size limit causing the stack to corrupt whatever lies in the memory region next door.
One way to be sure that the stack is behaving is to implement a stack monitor as part of the systems health and wellness code (How many engineers do this?). The stack monitor creates a buffer zone between the stack and “other” memory region with a known bit pattern. This pattern is then constantly monitored for any changes. If the bit pattern changes then the stack has grown too far and is on the verge of plunging the system into a dark abyss! The monitor can then log the occurrence, system states and any other useful data that can later be used to diagnose the issue.
A stack monitor isn’t uncommon to most RTOS’s or microcontroller systems that implement a memory protection unit (MPU). The part that is scary is that these are usually capabilities that are either off by default or that can be turned off by the developer. A quick search of the internet reveals recommendations of turning off the stack monitor in an RTOS to save 56 bytes of flash space! Take a moment and reflect on the imprecations!
Tip #5 – Use an MPU
It used to be that it was rare to find a memory protection unit on a small and inexpensive microcontroller but that is beginning to change! MPU’s are starting to show up across the whole spectrum of microcontrollers and this provides embedded software developers the opportunity to drastically improve the robustness of their firmware.
MPUs have generally been associated with operating systems to carve out memory spaces where separate processes or tasks can execute their code without fear of being stomped on. In the event that something does happen, the rampant process can be killed, or other protective measures can be performed. Keep an eye open for microcontrollers with this component and if it is present take advantage of it by using it!
Tip #6 – Create a Robust Watchdog System
An all-time favorite watchdog implementation to find is where the watchdog is enabled (which is a great start) but where the watchdog is cleared using a periodic timer that fires separate from anything that is going on in the program. The purpose of using the watchdog is to help ensure that if something goes wrong the watchdog is not cleared and at the given time the system is forced to go through a hardware reset to recover.
Embedded developers need to carefully think through and design how application tasks will be integrated into the watchdog system. For example, one technique might be to have each task that should run in a given period mark a flag that they were able to successfully perform their task. At the watchdog clear point if a task has not completed it signals that there is a problem and the watchdog is not cleared forcing a reset. Then there are more advanced techniques such as using an external watchdog processor that monitors how the primary processor is behaving and vice-versa.
Creating a robust watchdog system is absolutely critical to a reliable system. There are too many techniques to cover in a few paragraphs but dedicated articles on this topic will be coming soon.
Tip #7 – Avoid Dynamic Memory Allocation
Engineers that are not used to working in a resource constrained environment may be tempted to use features of their programming language that would allow for the use of dynamic memory allocation. After all, this is a technique that is commonly used on computer systems where memory is only allocated once it is needed. For example, when developing in C one may be tempted to use malloc to allocate space on the heap. An operation is performed and once completed the use of free would return the allocated memory for use on the heap.
On a resource-constrained system this could be a catastrophe! One of the problems with using dynamic memory allocation is that bugs or improper techniques can result in memory leaks or memory fragmentation. Most embedded systems don’t have the resources or the know-how to monitor the heap and if this occurs to properly handle it. Or even worse, what happens if the application requests space, but the requested space isn’t available?
The issues arising from using dynamic memory allocation can be complicated and to properly handle these issues can be a nightmare! The alternative is to simply allocate memory upfront statically. For example, instead of requesting through malloc memory for a buffer that is 256 bytes long simply create an array of size 256 in the program. This memory remains allocated throughout the lifetime of the application and there are no concerns of heap or memory fragmentation issues.
These are just a few ways in which a developer can start to create a more reliable embedded system. There are plenty of additional techniques such as using a good coding standard, monitoring for bit flips, performing array and pointer boundary checks and using assertions to name a few more. To learn more of the nitty gritty details of these and other techniques, consider signing up for my monthly Embedded Bytes Newsletter (located here) that will be covering these topics at great length!
What other techniques can be used to build a more reliable embedded system? I look forward to hearing your thoughts!
Great advice here. A few of these I have independently invented and used to make my code more reliable- the others are definitely good as well.
One I now do is “Prime all RAM before starting”. Some compiler/linker tools do this – but I’ve seen it not done. The value here is to avoid the chance for inconsistent results if RAM is used before proper priming if the C start up didn’t properly prime everything. .
Another I do when I don’t have a RTOS and have a virtual background task that is an eternal loop of calling different routines: log the worst case time it takes to get around that loop – that gives me a way to know worst case latency. Even with an RTOS, it could be good to have a dummy low priority thread that simply logs the worst case time it took (after init code all runs) to come around and service the thread.
Thanks for these tips, I use many of them but not all, time to add them to my toolbelt 🙂
Two that I use:
RAM Monitor: I create multiple variables at various addresses and fill them with known value, ie: 0xAAAA, 0x5555, 0x55AA, etc. Then I check them in the main program loop and if any are corrupted I reset the system as RAM is no longer reliable.
Library Function / Peripheral Timer: I create an interrupt driven variable which gets decremented down. Before calling a library function or accessing a peripheral device I start the variable with a known value above the time it should take to complete the process. Once complete I turn the timer off. If it ever gets to 0 then I know that there is an issue and reset the system.