In my opinion, one of the worst, most annoying faults to debug on an Arm Cortex-M microcontroller is a hard fault. If you are lucky, the hard fault appears after you’ve made some glaringly obvious mistake, and you can quickly undo it. I recently worked with a colleague who encountered a hard fault, but it was several commits deep, and I had no clue what could cause the fault. In this post, we’ll walk through the process I used to identify the cause and correct the hard fault.
An Imprecise Error
When a hard fault occurs, embedded developers have no choice but to dive into the depths of the microcontroller and examine the fault registers. The first register to examine on a deep dive is the Configurable Fault Status Register (CFSR). The CFSR is composed of three fault registers:
- The MemManage Fault Status
- The BusFault Status
- The Usagefault Status
Together, these registers can help us start down the path to understanding why we have a fault.
Unfortunately, the values stored in these registers are not always conclusive or helpful, depending on the hard fault. For example, when I examined the value of the CFSR register, I discovered it was set to 0x400. Figure 1 below details what the bits in the CFSR mean. A value of 0x400 is an imprecise error!
Figure 1 – The high-level register definition for CFSR. See here for additional details.
An imprecise error is an asynchronous fault, a bus fault that is forced due to a priority issue, disabling the fault, a memory access issue, or so forth. The problem with an imprecise fault is that you can’t trust that the other fault registers contain any direct or valuable information about the cause of the fault! That’s right, at this point, you’re in for reverting code or guessing and randomly trying different Band-Aids to try and fix the problem.
From Imprecise to Precise Errors
Thankfully, when you encounter an imprecise error causing your hard fault, all is not lost. The imprecise error may be caused by the CPU using an internal buffer to cache instructions. If the buffer is disabled, every instruction executed will be executed linearly. The result will be that the imprecise error turns into a precise error, and all the other fault registers may help identify the fault.
The steps to disable the buffer is straight forward. Developers can disable the write buffer by setting DISDEFWBUF in the ACTLR register. The code to do this looks something like the following:
SCnSCB->ACTLR |= SCnSCB_ACTLR_DISDEFWBUF_Msk;
In addition to disabling the write buffer, it’s also a good idea to make sure that the Usage, Bus, and Memory faults are enabled in the SHCSR register. These faults can be enabled using the following C code snippet:
// Enable Usage-/Bus-/Mem Faults
SCB->SHCSR |= SCB_SHCSR_USGFAULTENA_Msk
Compile the code, cross your fingers, and rerun the code. Hopefully, the imprecise error is now a precise error which allows us to dig much further into the cause of the hard fault. In my case, the CFSR register now reads 0x8200! We now have a precise error!
Debugging a Precise Error
Now that we have a precise error, we can examine the other bits in the CFSR register. In this case, the only other bit set is the BFARVALID bit. The BFARVALID bit tells us that the bus address stored in the BFAR register is a valid address and may tell us something about what has caused our fault. Initially, just by the BFARVALID bit being set, we can deduce that we have a bus fault causing our hard fault.
The BFAR register, the bus fault address register, in this case, holds a value of 0x100000. Interesting! Why is the processor faulting when the bus tries to access the address 0x100000? A quick investigation into the microcontroller memory map reveals that the memory address 0x100000 doesn’t exist! Flash memory, in this case, is from 0x0 to 0x100000. The processor should be throwing faults, but why is the compiler generating instructions outside the memory space?
A Memory Map Bug
Well, it turned out that my colleague was in the process of adding additional sections to the linker script. He was looking to add a section in memory for system configuration but unfortunately forgot that he had to resize the other flash sections. The result was that the new flash section was outside physical flash. The linker had the section specified and therefore didn’t care about putting accesses to this non-exist area of memory. The result was a hard fault caused by a precise bus error!
Troubleshooting hard faults on a microcontroller can be difficult if you don’t use the right process. In this post, we saw that developers could use the CFSR register to identify the cause of their hard fault. In a more complicated situation, developers might need to disable the CPU write buffer to change an imprecise error into a precise one. Once this is done, the time to identify the issue can be dramatically short. In total, this particular bus fault only took about 10 – 15 minutes from start to finish. However, it quickly could have taken me days. I hope this helps you quickly solve any future hard faults you or your team may encounter.
Thank you for the post! It opens eyes into a scenario (linker script not matching actual chip memory layout) which is hard to found very often…
From the other hand I would expect for some ARM lib like CMSIS to parse these registers and provide some user friendly error handler (stdlib based printout or simmilar) helping the user to get some clue right away… 🙂
Thanks for the comment. Unfortunately, in my experience, the only handler is a breakpoint in the hard fault handler. It would be great if the default handlers copied register values into a structure that allowed easy debugging, but I haven’t seen anyone do that. Maybe I should create an example for a future post …
I should probably extract the code I wrote even for myself. One problem it is tied to the type of the Cortex and to the compiler/linker, as there is no easy generic way to deal with these … CMSIS would be the proper place to do it there … at least covering GNU, IAR and Keil … and at least for Cortex-M3, 4 and 7 … to get it right it requires a careaful reading of at least two ARM manuals (one for the core type and the other for the interrupt controller)
Agreed! There used to be a CMSIS working group that met at Embedded World. I have not been since COVID, but maybe I can pass that along to my Arm contacts and see if they implement it…
This would be great. Even if they settled on just providing them as examples. Personally I see no reason why they would not be a part of CMSIS.
The way I understand it , it is already established that the ISV and related parts are to be modified by the user if so required.
I am pretty sure this problem is being solved again and again by many people in many ways but it would be good to have some standard to start from.
The other annoying thing is that so far as I know (unless that has changed lately) neither CMSIS nor development framework do not provide support or examples for the trap/faults handlers. At least 2 – 3 years ago that was the case when I had to write them from scratch.
The one feature we implemented proved very valuable: in the Release version traps/faults handlers would leave a small info in a RAM marked as not-initializable so on the following reset it could be read and at least displayed on the serial port (as the bugs would happy disappear when run under debugger!).
Also it is very handy to read and display the reset reason ASAP on the serial console!
Thanks for the comment. That has been my experience as well! Perhaps in a future post, I’ll give an example of how to do this. Thanks again!
Note that not all Cortex-M’s have the ACTLR, and when they have it, they might not have the DISDEFWBUF bit. It is IMPLEMENTATION DEFINED in ARMv7-M and absent in ARMv6-M. Cortex-M3 and M4 have the bit, but Cortex-M7 doesn’t.
One more real case, that analyzing of debug registers didn’t help:
Hard Fault rarely occurred at DMA interrupt handler of internal ADC during wait for STM32F427 internal flash ready. At the end, a consultant discovered the totally unexpected reason: “2.2.12 Data cache might be corrupted during Flash read-while-write operation”. See errata https://www.st.com/resource/en/errata_sheet/es0206-stm32f427437-and-stm32f429439-line-limitations-stmicroelectronics.pdf