Weekly Update from Alex - 06/09/16

I want to take a moment and thank everyone that provides continued feedback on the platform in response to my posts. We read every single response and turn many of them into actionable items that we use to further improve your experience and the platform. I appreciate everyone who is participating in this process and welcome continued feedback so we can continue to grow together.

This week’s update is a culmination of about a month’s worth of incremental and iterative improvements to the platform. Every platform release we have has many behind the scenes backend changes that when put together make a big impact on the SmartThings overall experience.


SmartApp Execution Times
We’ve been doing a lot of work in the background. Optimizing our codebase and tweaking the cloud we have been able to make some large improvements on SmartApp execution times. Over the past 4 weeks we have seen a 20% overall reduction in execution times and for outliers we have seen as large as a 50% reduction in execution times.

State Accuracy
State accuracy is a very important metric that we track. Since I have started posting my weekly updates we have continued to work hard to get state consistency to a level we can trust. I am pleased to say we have reached a 99.999% accuracy for states in both SmartApps and Device Handlers. This not only makes development easier on our platform but makes every facet of your experience on the platform more reliable.

Lastly, I wanted to share a fun anecdote from SmartThings HQ. In the office today at 3:51 PM, a playful rig that one of our engineers set up to demonstrate sensor battery life crossed a milestone. One of our tiny SmartThings SmartSense Multi sensors crossed over its 1 millionth event on a single coin cell battery. Good job, little sensor!


I am excited to see our continued incremental improvement to the platform and I will continue to update you weekly as we progress and make this the number one smart home platform in the world.

-Alex

36 Likes

I think the fact that it seems like the number of responses to these weekly updates is getting less and less is possibly an interesting metric in itself.

You’ve come a huge long way which should be applauded but at the same time let’s remember that this is a huge long way break fixing a platform rather than a huge long way advancing a good platform to a great platform. I know that the community devs still have a long list of issues, some of which that have existed for years that need to be addressed signalling that you still have a way to go.

Please don’t let better stability be replaced by complacency and let this happen again.

2 Likes

100% agree with you @Benji. We’re focusing on the basics right now. That is to say the core tenants of the platform should work and always work. The fun stuff is around the corner. :smiley:

5 Likes

@alex @slagle

I have several GE outlets that are controlled by GE switches. I use Smart Lighting to toggle the outlets when the switch is toggled. Example is my master bedroom where I have one outlet physically wired to a switch. The switch controls the one wired switch but through Smart Lighting, two GE outlets will toggle as well. One of your platform updates in the last many months causes Smart Lighting to send a command to the outlets every 15 minutes. I would assume this is some sort of safe guard for state accuracy, but its also very annoying. Sometimes my wife wants to leave just her light on while she is reading in bed. She turns off the main switch and all lights go out, she then turns on just her light through the app and all is good for about 15 minutes and her light will turn off since the main switch is off. Bad for WAF!

I reported the issue to support a while back and they said development is working on it but it hasn’t made the cut for the past several weeks. I am hoping this will be addressed soon. I had to stop using Smart Lighting for another switch/outlet because of this issue. I need the ability to control the outlet independently and remain ON if the switch is off.

When Smart Lighting was released, I was able remove many of the Smart Apps I had written since Smart Lighting addressed most of my use cases and it runs locally. But because of this issue, I am having to complicate my setup again. Thanks for your attention on this issue!

@ritchierich can you PM me the exact details for how you setup your Smart Lighting automation that behaves this way? I want mimic your specific case and sleuth out the problem form there.

3 Likes

@slagle thanks! I just sent you a PM. Appreciate the help.

the default z-wave switch type now repeatedly throws physical actions/events when polled, not status- lots of same issue in the Double Duty smart app thread

Or it’s failing miserably.

Came home, ST hub offline. App unable to connect. Smarttiles unable to connect.

Three sirens all going off, can’t shut them off.

lovely.

Yup. I got home and was informed that Alexa isn’t doing anything… and she’s not.

I guess I’ll be using the actual light switches tonight.

Outages will always occur; what I find particularly important at this stage is transparency on:

  1. Prompt notificaiton of the outage ( http://status.smartthings.com ).

  2. As much detail as possible of the ongoing outage (for Customers and Developers who wish to click through to find out more).

  3. Accurate ETA for resolution and accurate workaround suggestions, if applicable.

  4. Prompt notification of resolution or any other important updates (such as change in resolution ETA).

  5. Post-mortem detailed explanation of the cause of the outage and the cause of its duration and the plan to prevent recurrence.

  6. Published accurate rolling statistics of outage time (including partial outages).

  7. ? - Your wishes here.

7 Likes

For this outage, I got simultaneous notifications on the mobile app that two hubs separated by 100 miles, one on Comcast and one in Verizon went offline. It was pretty obvious there was a serious problem on SmartThings end.

It burns me up that three hours later, there’s no additional communication, no blip of connectivity. Just crickets.

I personally don’t care about the details. They’re interesting to me as a technologist and as someone who deals with outages and system problems professionally, but as a customer I just want to know that they broke something, but it is ok because it’s all hands on deck and they’re backing out the change that broke things.

No doubt.

Waiting for the delivery of that list, I don’t have anything to add to it.

Only to comment that, yes, while outages will occur, the ongoing outage length (this episode) is not acceptable. Possibly even the nature of the outage is unacceptable, but I don’t really know enough to say that, just speculating that there does not appear to be enough resiliency.Meaning that, let’s not broadly excuse away folly with narrowly meaningful truths.Gmail’s outages since it’s inception are acceptable IMO. ST is not anywhere near that type of reliability, so every new outage pushes us further into a perception of poor stability. Only once ST has a proven, accepted track record of stability would that explanation actually have meaning FOR ME, at least.

Tick Tock folks. Tick Tock.

2 Likes

The App is failing on launch…and nothing is working. Assume cloud is totally down.

First you say that you don’t want to know the details and then you specify exactly what details you want to know…? I’m confused. :confused:

  • You can absolutely assume that when there is a major outage posted on the Status Page that they have “all hands on deck” to fix it. It’s only a concern when the Status page doesn’t show a problem that many of us post here as evident.

  • If SmartThings continues on the good precedent set during the March and following minor outages, they will update the Status with some fix information including ETAs. They don’t always hit the ETA.

  • Unfortunately, SmartThings hasn’t been good at giving us conclusive reasons for the cause and resolution of incidents. The March outage (started March 13th, 2016 … Daylight Savings Day) is still claimed to be just a sudden unpredicted and unpredictable leg in the “cloud load factor” — OK, technically that’s “conclusive”, I guess, and the resolution has been described in Alex’s weekly updates (ref: the current Topic).

2 Likes

Three hours is a long time to address login issues.

Come on people. It gets better then we’re back to ground zero. You are never going to get the reputation of a solid platform doing this. Four months later and we still have hours long login issues that for the most part kill your app when you are 500 miles away and unable to fix things. By the SHM is stuck thanks to this issue as it basically threw users out and made presence broken then you can’t get in to even turn that off. So I’m in St Louis and wife is at home but we’re both away and can’t fix it. Unplugging is all we have to depend on.

2 Likes

I mean, there’s a lot that could go wrong. It’s interesting. But I’d rather I didn’t have to ponder them. My first wish is to be confident that it will be fixed quickly and surely.

We’re going into hour five or six of this outage. Apparently there wasn’t a quick fix, or it wasn’t prioritized. Either way, it doesn’t indicate stability is a top priority, to me.

2 Likes

On the plus side, the timing of this outage is decent. Developer released whatever it was earlier in the day, when she’s going to be around to monitor it, and be also not on a Friday when a problem could extend to the weekend and there’s less coverage.

I’m sure diagnosing and fixing it is currently a priority, so much so that there’s no use them taking resources to post further explanations while in progress.

But the reputation benefit / goodwill of all of @Alex’s optimism and progress over the past few weeks evaporates really quickly. :cloud_tornado:

Is the platform fundamentally susceptible to unpredictable outages? Or is the engineering and operations team fundamentally incapable of Quality Assurance and Contingency Management?

SmartThings added Robert Parker to the team as well as other high level engineers experienced in high availability, high transaction volume systems. Amazon and NASDAQ lose hundreds of $ millions during outages like this, so we know there are ways to reduce the frequency and duration…

2 Likes

Hey all,

We are hard at work to get things back to normal. Normal performance is returning. Please follow the status page for updates.

http://status.smartthings.com

2 Likes

Agreed. But my point is, something changed. Undo it! Diagnose it and fix it later.

2 Likes