Incident Report - SmartApp state - July 25, 2016

On July 25, 2016 we experienced an incident that involved the loss of SmartApp states. We realize that this had an impact on a subset of our valued customers and developers, and sincerely apologize to anyone who was impacted by this incident. We value your continued support during our growth to becoming the best Smart Home platform in the world and we are committed to being as transparent as possible as we grow together.

The following is an incident report for the SmartApp state event on July 25, 2016.

Summary
At 7:16 AM we began a migration to a new Cassandra schema. The changes were necessary to move us to a new, faster architecture. Shorly after starting the migration to the new schema, we discovered an issue that caused the SmartApp state to become corrupted. Upon reversing the migration, the inconsistent state was then written for both old and new schemas, causing some SmartApps to lose state or permanently uninstall.

Timeline on July 25, 2016 (PDT)
7:40 AM - European region migration started
7:52 AM - North America region two migration started
8:07 AM - North America region one started
8:14 AM - Reports from community apps losing app states
10:02 AM - Migration rolled back
11:15 AM - Root cause identified

Root Cause
There was a bug in the migration logic that handles the inconsistencies between the new and old schemas. When the integrity test compared the two queries, there was a possibility of the check failing. In instances where it failed, the states were merged incorrectly and corrupted the SmartApp state. Since the comparison passed an integrity test, the state was then persisted to the old and new schemas.

Resolution and Recovery
Since this change was flag type change in the database we were able to revert it quickly. Once we rolled back the change SmartApp states began to behave normally and SmartApp functions returned to normal.

Corrective and Preventative Measures
Our engineers have reviewed the data and are investigating to keep this from happening again in the future. In the short term, the following actions are being taken to help mitigate this kind of event in the future:

  • Sharing our first ever external incident report
  • Writing test cases to identify similar events before they happen
  • Improve internal communications for upcoming changes
  • Improved alerting to enable faster response

SmartThings is committed to improving our operational processes and preventing incidents like this in the future. We thank you for the patience and support during our journey to being the best Smart Home platform in the world.


In lieu of an update from Alex, I will be posting the incident report this week. Alex will resume his updates when he gets back from traveling with family.

19 Likes

Thanks for the info, Tim!

Is it possible to also add a goal of improving external proactive communications for upcoming changes, please?

3 Likes

This[quote=ā€œslagle, post:1, topic:54460ā€]
Improve internal communications for upcoming changes
[/quote]
leads to this[quote=ā€œtgauchat, post:2, topic:54460ā€]
improving external proactive communications
[/quote]

4 Likes

Mr. @slagle, this is our hobby, we hit it hard on weekends, but you? This is your job. Anyone would understand if you take a break on Saturday & Sunday. What are you doing online? Aren’t you tired of all of our whines and moans during the week? By the way, thank you for answering the other outstanding questions. You rock, once again man! #restlesswarrior

7 Likes

Gluten for punishment?

Haha no seriously, I had the answers, and the incident report was done, so thought I’d post.

I think this one took so long because it’s our first one, now we have a guideline for how to do these in the future. Feedback is welcome as well. Does this help give insight and transparency into what happened?

3 Likes

Cutback on the pizza :pizza: and bagels! :stuck_out_tongue_winking_eye:

5 Likes

Darn auto correct, lol!

2 Likes

Woah. I can’t imagine if something like this caused CoRE to uninstall for my installation. What was the scope of that issue?

Doesn’t this make it seem like the recovery was complete? Was it? Seems to me SmartApp states were lost permanently in some cases, no?

I think that should be called out, owning the issue for what it was. Similarly, I think the incident report on Status.Smartthings.com is unclear on this for July and the Mar issue. It’s hard to have credibility without it, IMO.It seems to fail to accept that there were permanent and lingering effects that left end users to both search them out and take action to correct. How many Joe Consumers still have issues in their system from the March incidents?

Unless ST is saying that analysis is incorrect, I think this report and the status page is fatally flawed. It really needs to cover the facts or it’s not acceptable, IMO.

While we do say we are investigating to keep this from happening, I don’t see anything noted that will prevent or correct what occurred.

Additional testing could, but I assume this was tested and something was missed. What I would be looking for is a comprehensive recovery and back out plan that restores previous states and functionality completely - because it is inevitable that the best laid test plans can still fail to catch everything.

Thanks for the update Tim, don’t take this as me being me… just trying to provide some constructive feedback on the process.

1 Like

Tim, thanks for this update. It is much appreciated…,

1 Like

I agree. The incident report glances over the fact that data was lost and there was no attempt to recover it. What is ST doing in regards to data recovery and why wasn’t an option 3-4 hours in when the problem was recognized?

4 Likes

Because they didn’t offer @ady624 that position yet, to my knowledge:-)

4 Likes

wow thank your for posting this @slagle looking forward to the continued improvement

1 Like

I am very interested in WHY the engineers that are actually there with the hardware have yet to develop a state recovery process. Especially when a community members had a working smartapp state recovery process built, tested, and published within 12 hours.

You guys are the professionals. You have access to ALL of the code. According to the copyrights on CoRE, and I’m sure you have Adrian’s contact info… why has the process he developed NOT been implemented? A user of your service proved within 12 hours what you guys have been saying was not possible since at least March.

You guys are working really hard at pushing through things like, Lux in Smart Lighting. How many users did that affect? How many users lost state information on 25 July?

I’m seeing a possible priority issue here? It’s great that people that drive BMW’s can tell their home to do things… but how great is it when they get home and nothing occurred? Oh, image the fury of the soccer moms when they get home with a fresh two dozen eggs, only to find out the refrigerator can’t count eggs.

I think you get my point.

Stability, stability, stability… leads to reliability.

All of that… and the fact that since the 25th things are getting worse daily.

What is being done about the lingering effects of the upgrade failure? There is obviously still a problem as a result of that fiasco.

Please, give some in depth answers.

2 Likes

We are still experiencing some server hot spots and have a team working on this. Keep reporting failures to support@ so we can trend issues and pull logs for investigation.

Yup, my modes have not been changing for the past 24 hours.

I have an idea, why don’t you create a special region for volunteer? With that, you can always test yr roll out on it first, in real user, advance user, give them some token in return :wink:

Many would be happy to volunteer on that special region, anyway without choosing to be in that, we are facing similar risk.

2 Likes

That’s a good idea, but also note, instead of rolling out one after another to each shard, why wouldn’t the roll go to one shard for a period of 24 hours, then the next, etc?

That being said, I am just guessing here (or second guessing as it may be)…

I am more interested in why the report is deficient. I would hope ST would respond to those deficiencies. There should be a dialog.

I’ll say this. From my experience in code deployment to a large deployment base… You can test and test and load test and user acceptance test and more load test until you are blue in the face. Production and qa/load/uat whatever all are ā€˜in sync’. Going to prod is always different. Something comes up and bites ya. So I dont totally blame them, same time recovery from mistakes is key.

Just like we dont like teams/bosses that are finger pointers. Own up to the failure, and then deal with it. Loss of data is one thing that is never acceptable. At least lost off data that impacts customers isnt acceptable.

My work has me impacting a couple million users, so I understand the tight lipness and not really wanting to give much info until you know. Sharing the wrong thing is worse than not sharing at all, sucks to say that but its the truth.

That being said…learning from it is key. What worries me is they have had quite a few learning experiences. And I think a lot of the learning problems happen because of the transition from a scrappy startup to ā€˜omg samsung money’. Focus was lost once, that was said to be corrected, we will see.

6 Likes

I will give credit… In the year I have been with this platform, I have seen some learning from the mistakes… I just honestly expected more.

2 Likes

This incident reports goal was to take 100% responsibility for this incident. No smoke, no mirrors, just the facts. We lost the states. There isn’t any more ā€œrecoveryā€ we could do, once the states were gone, they were gone. We are owning it, and we are sorry.

2 Likes