Tim, thanks for this update. It is much appreciated…,
I agree. The incident report glances over the fact that data was lost and there was no attempt to recover it. What is ST doing in regards to data recovery and why wasn’t an option 3-4 hours in when the problem was recognized?
Because they didn’t offer @ady624 that position yet, to my knowledge:-)
wow thank your for posting this @slagle looking forward to the continued improvement
I am very interested in WHY the engineers that are actually there with the hardware have yet to develop a state recovery process. Especially when a community members had a working smartapp state recovery process built, tested, and published within 12 hours.
You guys are the professionals. You have access to ALL of the code. According to the copyrights on CoRE, and I’m sure you have Adrian’s contact info… why has the process he developed NOT been implemented? A user of your service proved within 12 hours what you guys have been saying was not possible since at least March.
You guys are working really hard at pushing through things like, Lux in Smart Lighting. How many users did that affect? How many users lost state information on 25 July?
I’m seeing a possible priority issue here? It’s great that people that drive BMW’s can tell their home to do things… but how great is it when they get home and nothing occurred? Oh, image the fury of the soccer moms when they get home with a fresh two dozen eggs, only to find out the refrigerator can’t count eggs.
I think you get my point.
Stability, stability, stability… leads to reliability.
All of that… and the fact that since the 25th things are getting worse daily.
What is being done about the lingering effects of the upgrade failure? There is obviously still a problem as a result of that fiasco.
Please, give some in depth answers.
We are still experiencing some server hot spots and have a team working on this. Keep reporting failures to support@ so we can trend issues and pull logs for investigation.
Yup, my modes have not been changing for the past 24 hours.
I have an idea, why don’t you create a special region for volunteer? With that, you can always test yr roll out on it first, in real user, advance user, give them some token in return
Many would be happy to volunteer on that special region, anyway without choosing to be in that, we are facing similar risk.
That’s a good idea, but also note, instead of rolling out one after another to each shard, why wouldn’t the roll go to one shard for a period of 24 hours, then the next, etc?
That being said, I am just guessing here (or second guessing as it may be)…
I am more interested in why the report is deficient. I would hope ST would respond to those deficiencies. There should be a dialog.
I’ll say this. From my experience in code deployment to a large deployment base… You can test and test and load test and user acceptance test and more load test until you are blue in the face. Production and qa/load/uat whatever all are ‘in sync’. Going to prod is always different. Something comes up and bites ya. So I dont totally blame them, same time recovery from mistakes is key.
Just like we dont like teams/bosses that are finger pointers. Own up to the failure, and then deal with it. Loss of data is one thing that is never acceptable. At least lost off data that impacts customers isnt acceptable.
My work has me impacting a couple million users, so I understand the tight lipness and not really wanting to give much info until you know. Sharing the wrong thing is worse than not sharing at all, sucks to say that but its the truth.
That being said…learning from it is key. What worries me is they have had quite a few learning experiences. And I think a lot of the learning problems happen because of the transition from a scrappy startup to ‘omg samsung money’. Focus was lost once, that was said to be corrected, we will see.
I will give credit… In the year I have been with this platform, I have seen some learning from the mistakes… I just honestly expected more.
This incident reports goal was to take 100% responsibility for this incident. No smoke, no mirrors, just the facts. We lost the states. There isn’t any more “recovery” we could do, once the states were gone, they were gone. We are owning it, and we are sorry.
@slagle It’s all semantics, but the details are important!
This statement says SmartApp functions returned to normal, period.
This is not accurate.
No where does the status or incident report say, explicitly or even imply, that individual user systems were left in a degraded state and that users would need to look for such issues (how do they do that exactly?) and resolve them on their own to restore full functionality.
Look, I am not trying to be a jerk - but this is blatantly obvious and if we’re owning it, we need to say it.
Perhaps there are different expectations for an incident report (post mortem). This is not supposed to be a “customer” alert. It is meant for the more technical to tell them what happened for what they already know. We lost state.
Here’s the progression of the “story”.
We say we lost the states, which we did.
We created a bug
The bug incorrectly compared values[quote=“slagle, post:1, topic:54460”]
When the integrity test compared the two queries, there was a possibility of the check failing.
When that happened, the state everywhere became corrupt [quote=“slagle, post:1, topic:54460”]
In instances where it failed, the states were merged incorrectly and corrupted the SmartApp state.
Bad states were written everywhere
Again, this wasn’t mean to explain what happened, but how it happened. Two totally different goals.
How can one effectively explain how something happened to an audience that doesn’t know what happened?
They are not only absolutely related, they are inextricably linked.
Then there is this from a previous post:
Those statements don’t mesh.
Either way, neither the status nor the incident report clearly identifies what remained broken and what users needed to fix on their own. No matter what else is said about it, I don’t think that’s acceptable. Especially if the stated goal, as you said, was to take full responsibility for the incident.
So basically, we cannot recover states from a previous night’s backup?
@slagle Thanks for all the efforts that go into this. There is no report that will make every one person happy, but this was a good effort, direct with admission of lost state.
We can all play armchair architects, but years of working on a large system with national usage and having had hiccups I see you guys going the right direction. Discrete, partial data recovery from backups sounds neat and futuristic when not knowing the architecture, but unfortunately the real world of a realtime system in which its state continues to change makes that a bigger challenge than most appreciate. As @KevinH said, customer data loss is always considered unacceptable, so its rare to see this much shared. I applaud the report efforts and candor.
Incidents happen, and I see progress year over year. Not as magically fast as folks want, and we all see the road bumps here as power users. However, I can’t say I am displeased with a service provided for a $99 fee, and supports a complex ecosystem from a multitude of vendors that simultaneously allows any developer to give a go at blasting the infrastructure. I bought lots of items to be at 160+ devices, and not a ton ST branded, so I am getting large value for what I’ve directly paid SmartThings. While hiccups annoy me, I find myself mostly happy (once I got off my butt to V2 and most everything locally processed possible).
Keep up the good fight, Tim.
I’m still wondering why a state restore rebuild is not being jumped on as a high priority right now.
The ground work has been done… What’s stopping you?
Imagine, you could boast this… “as we grow as platform and a work wide system there may be bumps in the road, but we have a system in place that can help to prevent lost data”
And what about the other several dozen (or more) “high-priority” items?
I’m not joking, just curious about how you think SmartThings prioritizes efforts…
Agree that everything needs to be prioritized and there is always competition in that regard.
I think what would be helpful is seeing those priorities executed and marked off. ST has no obligation and will likely not be as transparent as one may like, however, something in that regard is called for because of where confidence stands.
When a ecosystem is healthy, moving forward and there is confidence - say Apple’s iPhone - there isn’t a need to share. But when you’ve lost the confidence of your clientele one way to rebuild is to become, more, transparent. I think that’s part of what Alex’s updates are supposed to be, for example…