How to Debug a Hard Fault on an Arm Cortex-M

In my opinion, one of the worst, most annoying faults to debug on an Arm Cortex-M microcontroller is a hard fault. If you are lucky, the hard fault appears after you’ve made some glaringly obvious mistake, and you can quickly undo it. I recently worked with a colleague who encountered a hard fault, but it was several commits deep, and I had no clue what could cause the fault. In this post, we’ll walk through the process I used to identify the cause and correct the hard fault.

An Imprecise Error

When a hard fault occurs, embedded developers have no choice but to dive into the depths of the microcontroller and examine the fault registers. The first register to examine on a deep dive is the Configurable Fault Status Register (CFSR). The CFSR is composed of three fault registers:

  • The MemManage Fault Status
  • The BusFault Status
  • The Usagefault Status

Together, these registers can help us start down the path to understanding why we have a fault.

Unfortunately, the values stored in these registers are not always conclusive or helpful, depending on the hard fault. For example, when I examined the value of the CFSR register, I discovered it was set to 0x400. Figure 1 below details what the bits in the CFSR mean. A value of 0x400 is an imprecise error!

Figure 1 – The high-level register definition for CFSR. See here for additional details.

An imprecise error is an asynchronous fault, a bus fault that is forced due to a priority issue, disabling the fault, a memory access issue, or so forth. The problem with an imprecise fault is that you can’t trust that the other fault registers contain any direct or valuable information about the cause of the fault! That’s right, at this point, you’re in for reverting code or guessing and randomly trying different Band-Aids to try and fix the problem.

From Imprecise to Precise Errors

Thankfully, when you encounter an imprecise error causing your hard fault, all is not lost. The imprecise error may be caused by the CPU using an internal buffer to cache instructions. If the buffer is disabled, every instruction executed will be executed linearly. The result will be that the imprecise error turns into a precise error, and all the other fault registers may help identify the fault.

The steps to disable the buffer is straight forward. Developers can disable the write buffer by setting DISDEFWBUF in the ACTLR register. The code to do this looks something like the following:

SCnSCB->ACTLR |= SCnSCB_ACTLR_DISDEFWBUF_Msk;

In addition to disabling the write buffer, it’s also a good idea to make sure that the Usage, Bus, and Memory faults are enabled in the SHCSR register. These faults can be enabled using the following C code snippet:

// Enable Usage-/Bus-/Mem Faults

SCB->SHCSR |= SCB_SHCSR_USGFAULTENA_Msk

| SCB_SHCSR_BUSFAULTENA_Msk

| SCB_SHCSR_MEMFAULTENA_Msk;

Compile the code, cross your fingers, and rerun the code. Hopefully, the imprecise error is now a precise error which allows us to dig much further into the cause of the hard fault. In my case, the CFSR register now reads 0x8200! We now have a precise error!

Debugging a Precise Error

Now that we have a precise error, we can examine the other bits in the CFSR register. In this case, the only other bit set is the BFARVALID bit. The BFARVALID bit tells us that the bus address stored in the BFAR register is a valid address and may tell us something about what has caused our fault. Initially, just by the BFARVALID bit being set, we can deduce that we have a bus fault causing our hard fault.

The BFAR register, the bus fault address register, in this case, holds a value of 0x100000. Interesting! Why is the processor faulting when the bus tries to access the address 0x100000? A quick investigation into the microcontroller memory map reveals that the memory address 0x100000 doesn’t exist! Flash memory, in this case, is from 0x0 to 0x100000. The processor should be throwing faults, but why is the compiler generating instructions outside the memory space?

A Memory Map Bug

Well, it turned out that my colleague was in the process of adding additional sections to the linker script. He was looking to add a section in memory for system configuration but unfortunately forgot that he had to resize the other flash sections. The result was that the new flash section was outside physical flash. The linker had the section specified and therefore didn’t care about putting accesses to this non-exist area of memory. The result was a hard fault caused by a precise bus error!

Conclusions

Troubleshooting hard faults on a microcontroller can be difficult if you don’t use the right process. In this post, we saw that developers could use the CFSR register to identify the cause of their hard fault. In a more complicated situation, developers might need to disable the CPU write buffer to change an imprecise error into a precise one. Once this is done, the time to identify the issue can be dramatically short. In total, this particular bus fault only took about 10 – 15 minutes from start to finish. However, it quickly could have taken me days. I hope this helps you quickly solve any future hard faults you or your team may encounter.

 

Share >