Software engineers have a bad habit of being very optimistic. This optimism often doesn’t just include their calculation on how long it will take for a specific task to be completed but also on potential failure modes for their system. If it works on the bench, the assumption is that it will also work in the field, no matter whether that field is in a noisy manufacturing floor or orbiting the Earth. Unfortunately, these assumptions are optimistic and one of the biggest ones is that data is always valid. I once worked on a project where data coming from a sensor was being corrupted and there was no way to verify whether the sensor data was correct! In today’s post, let’s look at several ways that developers can hope for the best and assume the worst when it comes to data integrity.
Tip #1 – At least use a parity check
Parity is a data integrity mechanism that looks at the number of ones in a data stream and will then adjust a parity bit to make the overall number of ones either odd or even. For example, let’s say that a sensor transmits a data message that is 16-bits wide and is odd parity. One of the bits, usually the least significant bit (LSB) will be used to set the parity. If the data that will be sent out is:
1000 1000 1000 100x
Then in order to be odd parity, the x will be set to 1 so that there are five ones in the data. If the parity had been even, then x would have been set to 0 since there are already four ones in the data.
Parity works well for detecting single bit flips. If a zero becomes one or a one becomes zero then the parity error can be detected. However, if two or more bits flip, it’s possible the error would go undetected. For that, a more robust technique is required.
Tip #2 – Use a checksum
A checksum is an algorithm that is designed to detect errors that occur naturally or randomly in a dataset. A checksum is often calculated over a set of data which then results in the checksum for the data. The dataset often contains the checksum within the data, so those bytes are ignored when calculating the checksum. The calculated checksum is then compared to the checksum that came with the data to see if they match.
It’s important to realize that all checksums are not created equal and can detect different errors. For example, one checksum may be able to detect that a single bit has changed but a different checksum may be able to detect if several bits change simultaneously. Just because a checksum is found to match does not guarantee that there are no errors in the data! Checksums are also good at detecting random errors but are not necessarily going to detect intentional changes such as from someone trying to compromise the system. Developers need to carefully select the checksum that they choose to use for their application. (There are plenty out there so we won’t go into details here but one of my personal favorites for use on a microcontroller is the Fletcher16 checksum).
Tip #3 – Use a cyclical redundancy check (CRC)
A CRC is actually a checksum, but it is a very special type of checksum. A CRC is a checksum that uses polynomial division to calculate the checksum. As you can imagine, on an embedded system, especially a microcontroller based embedded system, performing polynomial division is computationally expensive! There are added benefits though in that a CRC can detect a larger range of errors than simpler checksums. CRC are so effective that many microcontroller vendors will include a hardware-based CRC calculator to allow a developer to use a CRC efficiently. Unfortunately, it is very hit and miss as to whether it is included or not, so developers need to carefully read their microcontrollers datasheet.
Conclusions
In a world of ever-growing data collection and analysis, it’s important to realize that our systems can’t just trust every bit of data that comes across a bus. Electromagnetic interference, cosmic rays and other sources can cause bits to flip and become corrupted. Without some mechanism in place to detect those bit flips, it is possible that the data will be acted upon and while in many circumstances this won’t be a big deal, there are times where it could be catastrophic to the system. The three techniques we discussed today are simple and easy to implement and provide a simple sanity check on all the data going into and out of a system.
Good engineers do not make optimistic assumptions. When I first began this career, I worked with a pair of seasoned engineers that wrote a state machine which would wait for a specific criterion to move from state to state. There was no timeout functionality, no error condition or bad packet handling, etc. – everything was expected to click along perfectly. When I pointed out a single serial communication failure would hang the system in any state and they at the very least needed to implement a timeout to retry the transaction, they were taken aback at such an odd suggestion from a junior engineer. Ultimately, they had to rewrite their code after a manager intervened. These two felt they we excellent engineers. The rest of us were glad when they left to pursue other opportunities.
My suggestion is: use always CRC. You can calculate a CRC without polynomial division, just use the tabulated version of the algorithm. Its FAST and it only has the drawback of needing a table of 256 flash bytes for a CRC8 or two tables of 256 bytes each for a CRC16.
For the sensor data with no parity or checksum, you can still do some validation. There may be values which are out of range or clearly non-physical eg absolute temperatures below 0K. If the readings are regularly spaced in time, you can use knowledge of the maximum possible rate of change to reject readings that must be corruptions, and there are some clever filtering algorithms which can do much more.
The authoritative source on CRCs is Prof Koopman and it is well worth reading his papers at https://users.ece.cmu.edu/~koopman/projects.html#crc. My own summary and spreadsheet analysis of his data is at http://blog.martincowen.me.uk/tags/crc/