Original post date: July 22, 2024
Last updated: July 22, 2024
On July 19, 2024, we saw a massive IT outage caused by a failed update from Crowdstrike, a cybersecurity company, pushing Windows kernels into a reboot loop. This is not a directly embedded issue, even if some embedded products, like kiosks, were affected. However, embedded developers can learn a few lessons from the issue.
Test your updates
This has been said so many times. Always test your updates, even trivial ones. If you can, have a test environment that looks like the real production environment, including systems with previous versions of your software—not just freshly flashed ones.
Have a recovery method in case of a failed update
Examples of recovery methods include A/B updates with recovery on a watchdog.
Push remote updates gradually
if you are pushing updates (you initiate them, contrary to the situation when the device periodically checks for updates), start from a small set of devices. If all works well for them, enable another batch, then another, and so on. A gradual update process also decreases the load on your update server. If it is an emergency security update, the intervals might be shorter, or you might start from critical or most likely affected systems.
Include telemetry about failed updates
If possible, make your update system return information on failed updates. Then, you can act rapidly and issue a fix. In an embedded system, a failure of an update might be related to hardware differences, for example.
Develop in user mode
If you do not have to, implement your function at the application level, not the kernel level. A function implemented at the kernel level can crash the whole system in many cases, but one implemented in an application usually won’t.
Comments
[…] down a number of systems running Windows, including kiosks at airports and the like. See my writeup https://ygreky.com/2024/07/what-can-embedded-developers-learn-from-the-crowdstrike-issue/ and hundreds of articles over the Internet. Crowdstrike has released a post-incident review at […]