On July 25, 2016 we experienced an incident that involved the loss of SmartApp states. We realize that this had an impact on a subset of our valued customers and developers, and sincerely apologize to anyone who was impacted by this incident. We value your continued support during our growth to becoming the best Smart Home platform in the world and we are committed to being as transparent as possible as we grow together.
The following is an incident report for the SmartApp state event on July 25, 2016.
Summary
At 7:16 AM we began a migration to a new Cassandra schema. The changes were necessary to move us to a new, faster architecture. Shorly after starting the migration to the new schema, we discovered an issue that caused the SmartApp state to become corrupted. Upon reversing the migration, the inconsistent state was then written for both old and new schemas, causing some SmartApps to lose state or permanently uninstall.
Timeline on July 25, 2016 (PDT)
7:40 AM - European region migration started
7:52 AM - North America region two migration started
8:07 AM - North America region one started
8:14 AM - Reports from community apps losing app states
10:02 AM - Migration rolled back
11:15 AM - Root cause identified
Root Cause
There was a bug in the migration logic that handles the inconsistencies between the new and old schemas. When the integrity test compared the two queries, there was a possibility of the check failing. In instances where it failed, the states were merged incorrectly and corrupted the SmartApp state. Since the comparison passed an integrity test, the state was then persisted to the old and new schemas.
Resolution and Recovery
Since this change was flag type change in the database we were able to revert it quickly. Once we rolled back the change SmartApp states began to behave normally and SmartApp functions returned to normal.
Corrective and Preventative Measures
Our engineers have reviewed the data and are investigating to keep this from happening again in the future. In the short term, the following actions are being taken to help mitigate this kind of event in the future:
- Sharing our first ever external incident report
- Writing test cases to identify similar events before they happen
- Improve internal communications for upcoming changes
- Improved alerting to enable faster response
SmartThings is committed to improving our operational processes and preventing incidents like this in the future. We thank you for the patience and support during our journey to being the best Smart Home platform in the world.
In lieu of an update from Alex, I will be posting the incident report this week. Alex will resume his updates when he gets back from traveling with family.