Hub Firmware Release 17.11 Post-mortem

Problem

Issues with the Hub firmware release 17.11 led to offline Hubs requiring manual user intervention.

Overview

On Wednesday, March 22nd during a scheduled firmware update of 8000 Hubs to Hub firmware 17.11, approximately 115 Hubs experienced issues leaving them in an offline state. Nearly all of these Hubs were recovered and successfully completed the 17.11 update after a manual power cycle. A small number of Hubs failed to negotiate an IP address using DHCP due to a limitation of certain Cisco hardware. Two additional Hubs were left in a state where power cycling was not sufficient for recovery and had to be replaced. They are being returned to SmartThings engineering for evaluation.

Hubs recovered by reboot

Investigation after the update led to the discovery of flaws in the firmware updater client and server used by SmartThings Hubs. Each flaw on its own was not enough to cause the problems experienced; it was the combination of these flaws that led to the issues experienced.

The first flaw existed in the firmware updater client. While receiving the update file, data is passed through a pipeline. Each stage of the pipeline performs a data transform - decompression, decryption, and data validity checking using a cryptographic signature. If corrupt or incomplete data is received, the pipeline rejects it. However, the pipeline caused a small number of Hubs to raise an incorrect error type - tagging the problem as network related instead of stream related.

When a network error occurs, the client is programmed to retry the last requested file segment and then continue the download - this is a requirement to work with unreliable network connections. In the case of the client flaw, tagging the stream error as a network error caused the client to retry from the same segment it just downloaded. It did not, however, force it to properly flush the pipeline of bad data. As a result, more data would be piled into an already corrupt pipeline compounding the issue until the entire buffer for the pipeline had been filled and nothing new could be fed through.

Normally, the process should have been terminated and the Hub restarted by our watchdog process. However, in this case, the watchdog was still being activated by the main updater process. The download process had triggered what is known as a live lock situation. No forward progress was being made, but the process was still alive, so the watchdog did not trigger. Since all the buffers were memory-only, a reboot was enough to clear the error condition and restart the update.

Digging deeper into the problem still left the question why did the client receive corrupt or incomplete packets? This uncovered a separate flaw in the update server software. The update server loads configuration over HTTPS from Amazon S3, an operation that is usually almost instantaneous. However, there are occasional hiccups where a load can time out. Due to the server flaw discovered, it was possible for S3 timeout issues to block service to new connections to the update server. This very quickly filled connection buffers to capacity and new connections were rejected. Despite random backoffs from clients whose connections had been rejected, there was enough load placed on the server to begin affecting existing connections, and some data frames were not fully transmitted. Due to the issue in the update client outlined above, it was possible for an incomplete frame to be written into the data pipeline and then trigger a retry, cascading to an eventual live-lock as described above.

All of the flaws in the updater software discovered as part of this investigation have been fixed. Additionally, the following new measures are being taken to help prevent similar issues in the future:

  • Additional server capacity will be allocated during firmware roll-outs to prevent the connection starvation issue that triggered the update client issue.
  • New testing is being done with the server failure scenario to ensure incomplete packets are not being sent to the client.
  • Loading of server configurations is now completely separated from lookup.

Hubs failing to negotiate DHCP

A small number of Hubs (we have two known instances) using network hardware from Cisco were unable to obtain a DHCP address after rebooting during the update to 17.11. The root cause of this issue was traced back to a limitation on select Cisco hardware where DHCP hostnames must be 32 characters or less. Please see this community posting for the original discussion; we greatly appreciate the help from community member @tmclink in identifying and debugging this. This issue has been fixed by reducing the DHCP hostname sent by the Hub to 19 characters.

Hubs not recovered by reboot

At this time, the root cause of a few Hubs that did not come back online following a reboot is not definitively known. We have requested the Hubs, but not yet received these Hubs back for analysis - therefore all debugging work is speculative. Based on server log activity, we are able to pinpoint a general area of code that these Hubs were executing at the time of their failure. This code is responsible for rearranging data to support the new full image update scheme - and as such must be extremely error tolerant. We identified two edge-cases where a very specific chain of errors might leave the Hub in an unrecoverable situation if power is lost. We also created and implemented a new failsafe that was not present before.

Due to the very low occurrence rates of these failures, it is difficult to confirm that they have been definitively fixed for every possible scenario. As a result, if anyone experiences issues like this during the upcoming 17.12 update, we will be offering him/her options to receive a replacement Hub right away or - if you have a large network of devices and/or SmartApps - send your existing Hub in for repair (side note - the pain of not having a v2->v2 migration tool available is not lost on me). We will of course, also be monitoring deployment very closely to apply the brakes again if the red LED - or any other type of failures - increase in occurrence.

Conclusion

Issues with the Hub firmware release 17.11 leaving Hubs with a flashing magenta LED that never clears have been addressed (and Hubs in this state will be recovered by a reboot). Issues with Hubs stuck with a red LED are still under investigation, but multiple actions have been taken and monitoring continues.

Please don’t hesitate to ask any questions!

55 Likes

The road to the future paradise is a very bumpy road!

Very excellent RFO post!

Can’t get this kind of transparency from 99.9% of companies. It’s good to hear the internals, even if generalized a bit!

9 Likes

Thank you for this. This is far more detailed than most companies would release and I think goes to show the understanding SmartThings staff has for the level of clients that visit this forum.

7 Likes

Interesting read. Thanks for sharing!

1 Like

This is day and night above what most hardware companies do.
Providing a little information goes a long way toward instilling confidence in your user base.
Issues happen - it is really nice to know you address them.
Thank you!

1 Like

Communication like this is why I LOVE SmartThings and it makes easier to deal with the occasional hiccups.

2 Likes

Wow, most companies say “stuff went bad, your loss, we accept no responsibility.” I just want to say thank you for telling us what happened.

This kind of communication is what all companies need. Much appreciated. Keep up the good work.

and for the 2 hubs that wouldn’t come back on line… I really feel for you, setting up a new hub can be a large headache if you have 100+ devices like I know some do. SmartThings should give you two something special for the trouble.

2 Likes

Why not hold off on the updates until all the issues have been sorted out?

4 Likes

Im clearly one of those HUBS NOT RECOVERED BY REBOOT… My Hub is not going online since a couple of weeks now… Eventhouhg Ive been in contact with the ST Support Ive not been asked to send the HUB…

Thank you. The transparency in sharing this information is exceptional.

The background information also reinforces my belief that this is the best avaiailble home automation system for my needs.

Well done!

2 Likes

Are you telling me I’m one of two hubs that got toasted. I didn’t get a chance to a UPS droffoff today. I’ll make sure and get it there tomorrow.

1 Like

It’s kind of scary to know that a bad firmware update roll-out can end up bricking your hub for good. Having to re-do all my configuration on a new hub wouldn’t be my main concern though (since I don’t have that many sensors) but rather the fact that I live in Europe but not in the UK, and Samsung actively refuses to service hubs and sensors still under warranty unless you have an UK or US forwarding address (which I don’t). So I guess my question is: Should my hub get bricked like this, would Samsung actually step in and provide a full replacement service, free of charge, regardless or where I live? Otherwise give me the option to manually defer firmware updates to my hub until I can be reasonably sure that it won’t cause issues (eg. a few weeks later than the rest).

1 Like

This is a totally fair question. The short answer is that we won’t know that all the issues have been sorted out until we put the changes to the ultimate test - updating the fleet of production hubs. We put the latest firmware through many, many hours of test internally across development hubs and internal beta users. We then put it through many more hours of tests with a customer beta. And yet issues still come out because a Hub has just the right combination of hardware state + installed software + cosmic radiation + internal dust-bunnies that it triggers some new error state that we hadn’t considered.

The 17.11/17.12 update is unusual in that one of the most complicated updates to date. The reason we want to release it as soon as we can though is that, once it is applied, it will be a significant boon to being able to release future updates in a safe way. We don’t want a “brick” any more than you do, and this is a significant step to getting there. Unfortunately, we need to use the riskier legacy update mechanism one more time on the way to a much better place.

3 Likes

Please keep working with our support team - your offline hub is a v1 hub and was not affected by this issue, which only affected v2 hubs.

Time to go get a lottery ticket? Or is this a sign to avoid chance-based games for a while? :wink:

Not to add to the fear, but this is a risk with every embedded device on the market today. We know how much people rely on their hubs, that’s why we’re pushing technologies like those in this update to further to reduce the possibility of damage.

1 Like

Like many others who have already commented here, wow. Thank you for the inside glimpse on what’s happening behind the scenes. Even though I wasn’t affected by the issue with this update, it was fascinating to learn more about how the technology that powers my smart home works. As a tinkerer, I couldn’t be more thrilled to be on the ST platform.

2 Likes

I wish everyone talked to me like this, so refreshing.

1 Like

And THIS (insert picture of me beating dead horse here) is why we need to have some sort of migration tool. I’ve always said, it doesn’t have to be perfect, but there should be options to save settings and re-apply them on the new hub.

I do not believe that it is not possible. As a former boss once said, “Anything is possible with the correct amount of resources.” And, that statement is absolutely true. I’m still on a v1 Hub, with a v2 still in the box, because it will be a pain to covert over and I don’t have a complex setup.

4 Likes