Problem
Issues with the Hub firmware release 17.11 led to offline Hubs requiring manual user intervention.
Overview
On Wednesday, March 22nd during a scheduled firmware update of 8000 Hubs to Hub firmware 17.11, approximately 115 Hubs experienced issues leaving them in an offline state. Nearly all of these Hubs were recovered and successfully completed the 17.11 update after a manual power cycle. A small number of Hubs failed to negotiate an IP address using DHCP due to a limitation of certain Cisco hardware. Two additional Hubs were left in a state where power cycling was not sufficient for recovery and had to be replaced. They are being returned to SmartThings engineering for evaluation.
Hubs recovered by reboot
Investigation after the update led to the discovery of flaws in the firmware updater client and server used by SmartThings Hubs. Each flaw on its own was not enough to cause the problems experienced; it was the combination of these flaws that led to the issues experienced.
The first flaw existed in the firmware updater client. While receiving the update file, data is passed through a pipeline. Each stage of the pipeline performs a data transform - decompression, decryption, and data validity checking using a cryptographic signature. If corrupt or incomplete data is received, the pipeline rejects it. However, the pipeline caused a small number of Hubs to raise an incorrect error type - tagging the problem as network related instead of stream related.
When a network error occurs, the client is programmed to retry the last requested file segment and then continue the download - this is a requirement to work with unreliable network connections. In the case of the client flaw, tagging the stream error as a network error caused the client to retry from the same segment it just downloaded. It did not, however, force it to properly flush the pipeline of bad data. As a result, more data would be piled into an already corrupt pipeline compounding the issue until the entire buffer for the pipeline had been filled and nothing new could be fed through.
Normally, the process should have been terminated and the Hub restarted by our watchdog process. However, in this case, the watchdog was still being activated by the main updater process. The download process had triggered what is known as a live lock situation. No forward progress was being made, but the process was still alive, so the watchdog did not trigger. Since all the buffers were memory-only, a reboot was enough to clear the error condition and restart the update.
Digging deeper into the problem still left the question why did the client receive corrupt or incomplete packets? This uncovered a separate flaw in the update server software. The update server loads configuration over HTTPS from Amazon S3, an operation that is usually almost instantaneous. However, there are occasional hiccups where a load can time out. Due to the server flaw discovered, it was possible for S3 timeout issues to block service to new connections to the update server. This very quickly filled connection buffers to capacity and new connections were rejected. Despite random backoffs from clients whose connections had been rejected, there was enough load placed on the server to begin affecting existing connections, and some data frames were not fully transmitted. Due to the issue in the update client outlined above, it was possible for an incomplete frame to be written into the data pipeline and then trigger a retry, cascading to an eventual live-lock as described above.
All of the flaws in the updater software discovered as part of this investigation have been fixed. Additionally, the following new measures are being taken to help prevent similar issues in the future:
- Additional server capacity will be allocated during firmware roll-outs to prevent the connection starvation issue that triggered the update client issue.
- New testing is being done with the server failure scenario to ensure incomplete packets are not being sent to the client.
- Loading of server configurations is now completely separated from lookup.
Hubs failing to negotiate DHCP
A small number of Hubs (we have two known instances) using network hardware from Cisco were unable to obtain a DHCP address after rebooting during the update to 17.11. The root cause of this issue was traced back to a limitation on select Cisco hardware where DHCP hostnames must be 32 characters or less. Please see this community posting for the original discussion; we greatly appreciate the help from community member @tmclink in identifying and debugging this. This issue has been fixed by reducing the DHCP hostname sent by the Hub to 19 characters.
Hubs not recovered by reboot
At this time, the root cause of a few Hubs that did not come back online following a reboot is not definitively known. We have requested the Hubs, but not yet received these Hubs back for analysis - therefore all debugging work is speculative. Based on server log activity, we are able to pinpoint a general area of code that these Hubs were executing at the time of their failure. This code is responsible for rearranging data to support the new full image update scheme - and as such must be extremely error tolerant. We identified two edge-cases where a very specific chain of errors might leave the Hub in an unrecoverable situation if power is lost. We also created and implemented a new failsafe that was not present before.
Due to the very low occurrence rates of these failures, it is difficult to confirm that they have been definitively fixed for every possible scenario. As a result, if anyone experiences issues like this during the upcoming 17.12 update, we will be offering him/her options to receive a replacement Hub right away or - if you have a large network of devices and/or SmartApps - send your existing Hub in for repair (side note - the pain of not having a v2->v2 migration tool available is not lost on me). We will of course, also be monitoring deployment very closely to apply the brakes again if the red LED - or any other type of failures - increase in occurrence.
Conclusion
Issues with the Hub firmware release 17.11 leaving Hubs with a flashing magenta LED that never clears have been addressed (and Hubs in this state will be recovered by a reboot). Issues with Hubs stuck with a red LED are still under investigation, but multiple actions have been taken and monitoring continues.
Please don’t hesitate to ask any questions!