Software engineers have a bad habit of being very optimistic. This optimism often doesn’t just include their calculation on how long it will take for a specific task to be completed but also on potential failure modes for their system. If it works on the bench, the assumption is that it will also work in the field, no matter whether that field is in a noisy manufacturing floor or orbiting the Earth. Unfortunately, these assumptions are optimistic and one of the biggest ones is that data is always valid. I once worked on a project where data coming from a sensor was being corrupted and there was no way to verify whether the sensor data was correct! In today’s post, let’s look at several ways that developers can hope for the best and assume the worst when it comes to data integrity.
Tip #1 – At least use a parity check
Parity is a data integrity mechanism that looks at the number of ones in a data stream and will then adjust a parity bit to make the overall number of ones either odd or even. For example, let’s say that a sensor transmits a data message that is 16-bits wide and is odd parity. One of the bits, usually the least significant bit (LSB) will be used to set the parity. If the data that will be sent out is:
1000 1000 1000 100x
Then in order to be odd parity, the x will be set to 1 so that there are five ones in the data. If the parity had been even, then x would have been set to 0 since there are already four ones in the data.
Parity works well for detecting single bit flips. If a zero becomes one or a one becomes zero then the parity error can be detected. However, if two or more bits flip, it’s possible the error would go undetected. For that, a more robust technique is required.
Tip #2 – Use a checksum
A checksum is an algorithm that is designed to detect errors that occur naturally or randomly in a dataset. A checksum is often calculated over a set of data which then results in the checksum for the data. The dataset often contains the checksum within the data, so those bytes are ignored when calculating the checksum. The calculated checksum is then compared to the checksum that came with the data to see if they match.
It’s important to realize that all checksums are not created equal and can detect different errors. For example, one checksum may be able to detect that a single bit has changed but a different checksum may be able to detect if several bits change simultaneously. Just because a checksum is found to match does not guarantee that there are no errors in the data! Checksums are also good at detecting random errors but are not necessarily going to detect intentional changes such as from someone trying to compromise the system. Developers need to carefully select the checksum that they choose to use for their application. (There are plenty out there so we won’t go into details here but one of my personal favorites for use on a microcontroller is the Fletcher16 checksum).
Tip #3 – Use a cyclical redundancy check (CRC)
A CRC is actually a checksum, but it is a very special type of checksum. A CRC is a checksum that uses polynomial division to calculate the checksum. As you can imagine, on an embedded system, especially a microcontroller based embedded system, performing polynomial division is computationally expensive! There are added benefits though in that a CRC can detect a larger range of errors than simpler checksums. CRC are so effective that many microcontroller vendors will include a hardware-based CRC calculator to allow a developer to use a CRC efficiently. Unfortunately, it is very hit and miss as to whether it is included or not, so developers need to carefully read their microcontrollers datasheet.
In a world of ever-growing data collection and analysis, it’s important to realize that our systems can’t just trust every bit of data that comes across a bus. Electromagnetic interference, cosmic rays and other sources can cause bits to flip and become corrupted. Without some mechanism in place to detect those bit flips, it is possible that the data will be acted upon and while in many circumstances this won’t be a big deal, there are times where it could be catastrophic to the system. The three techniques we discussed today are simple and easy to implement and provide a simple sanity check on all the data going into and out of a system.