SmartThings Downtime and What We're Doing

Ben · October 29, 2014, 2:14am

Simply put, the performance you’ve seen from the platform the past few days is unacceptable. We apologize. Our infrastructure team is working through the night to stabilize the platform while we roll out the changes explained below.

SmartThings is growing rapidly and adding many new customers and connected devices every day. This is an exciting time, but it has also caused short-term challenges. Specifically, over the last few days, SmartThings has experienced service interruptions related to system capacity and load. We know that these interruptions impact you, and we are actively working to solve them.

Our engineering and tech-ops teams are implementing changes both to the production infrastructure of SmartThings and to the Cloud architecture that will provide for much greater scalability and system reliability. We are rolling out the first of these changes over the next two days, with more to come over the next week and beyond. The changes include increased infrastructure capacity and application efficiency, which should have significant positive effects both in the short and long term.

In the meantime, we will continue to vigilantly monitor system performance and update the status page to keep everyone informed of changes to SmartThings availability. You can stay informed of SmartThings platform status by subscribing to updates at http://status.smartthings.com.

Again, we are sorry for the disruption these issues are causing. We truly appreciate your patience as we roll out improvements in the coming days and weeks.

Ben · October 29, 2014, 2:17am

The improvements we are rolling out include many things but foremost among them is a migration to Cassandra — which was developed by Facebook to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.

Immediate improvements customers should see:

Improved system stability
Greatly improved scalability
Ability to replicate data around the world for geographic redundancy
and disaster recovery

digitalnative · October 29, 2014, 2:35am

Thanks Ben. I really appreciate it when a company levels with their customers and lets them know what’s going on. You guys have worked through these issues before. I’m sure you’ll get this ironed out too.

Wolfram · October 29, 2014, 2:44am

Just a question about switching too Cassandra. There seems to be a lot of code that directly links to files , (mostly images) at amazonaws.com. Will switching systems effect old code?

greysonmorrow · October 29, 2014, 2:47am

@Ben thanks for the explanation. This is great news I believe moving to Cassandra. I’m sure I can speak for many people and say we do appreciate the explanation. Although the outages are still an issue it gives us as consumers a sigh of relief knowing you realize the issues and are actively reacting to them.

That’s not to say with this type of growth and big announcement from Samsung that you shouldn’t have been proactive in all of this and planned for it a little while ago. But it does feel good knowing you are actively working on it.

I do see though how just a few hours/days of downtime has made me rethink just how much I want to rely on ST. Looking forward to hub 2.0 features. I stated before, I hope you have 2 versions. Premium and entry level.

bmmiller · October 29, 2014, 3:13am

Cassandra and Amazon AWS are not mutually exclusive. I can’t possibly know the extent of what they’re doing since I’m in no way affiliated with SmartThings, but using Cassandra doesn’t necessarily mean you aren’t using Amazon AWS by any means.

Dlee · October 29, 2014, 3:18am

Hopefully you are targeting performance goals in the last two points above? The App and IDE experience is really slow when you have many devices, even when the system is healthy.

SparkyXI · October 29, 2014, 3:52am

Thanks very much for the explanation, @Ben. As most of us are tech geeks, I’ve got to believe that we understand what’s happening (at least I do). I just got ST last week, and I don’t doubt your efforts one bit. Keep doing what you’re doing, and thanks to you and the ST team for all the hard work. In the meantime, I’ll just pull the batteries out of my alarm.

andrewcbrooks · October 29, 2014, 3:57am

We will work hard to win your confidence back. Rest assured we’ve been planning for this even before Samsung. Look for a customer announcement in email shortly.

jefo13 · October 29, 2014, 3:59am

Thanks for posting this, I kept resetting smartapps trying to figure out why my lights weren’t coming on as planned and the door not unlocking…

Ben · October 29, 2014, 3:59am

We are sticking with many parts of AWS and those files you mention will not be affected.

urman · October 29, 2014, 4:08am

Cassandra is one part, there are lots of smaller changes with large reaching implications.

Frankly put, this is a big deal to the platform. Ability to withstand increased loads, increased traffic, more hubs, more people. These are major major updates. The last few days have sucked for all customers. It feels terrible to put people through this, and more importantly hurt trust in us and what we’re doing.

The last few days have been an attempt at preparing the entire platform for migration to a new one. The results of which will be wide reaching. Our platform team, just about all of our founders, marketing, and support will be working through the night as we expedite the platform migration to right now. Everyone should expect an email from Alex with more details, and another when it is completed.

Ben · October 29, 2014, 4:10am

Who’s sitting it out?

beckwith · October 29, 2014, 4:18am

Love all nighters…

Personally, just make sure it is stable on Halloween because I have a lot of props depending on it!

tgauchat · October 29, 2014, 4:23am

Please pardon me for being an “armchair quarterback”, but I come from a corporate IT background (mostly financial services).

I’ve been involved in countless major migrations, and a few, indeed, resulted in unplanned slowdowns or outages.

For the most part, however, all such changes went through extensive testing in parallel environments (QA and performance), were then deployed during announced scheduled timeslots, and had failsafe fast rollback / fallback plans (also tested, and, frequently used and those saved our asses).

What’s changed in the world? I have a lot to learn…

Ben · October 29, 2014, 4:31am

@tgauchat We have been planning this migration for much of this year. We have had a parallel environment running for several months and been testing extensively on it. Tonight was the PLANNED roll out night (with virtually no down time) but our team has been fighting the system issues for much of the weekend and past few days. We simply looked at what we were doing and felt that we should go ahead with the planned migration and accept some downtime (since we were already faced with it) than risk another day or two like the past few. We are a veteran team with thoughtful leaders and planned initiatives but we are also agile, and able to change course when the circumstances dictate. I think most of our community will appreciate that.

tslagle13 · October 29, 2014, 5:05am

Many companies cannot do this due to lack of vision. I trust in the SmartThings vision and know that this change has been a long time coming. The writing has been on the wall, and this change, by me at least, is welcomed with open arms! Thanks again for being so transparent!! ya’all run your company like many refuse to and IMO this wins over anything. I’d rather put my faith in a company/platform I feel has my back then one that could care less about me and only sees me as $$$$…

I challenge anyone on this thread to find a company in this field that has a community like this.

tgauchat · October 29, 2014, 5:16am

Sounds good! Looks like the old system glitches just peaked at the wrong time (or right time, since you have the upgrade staged and ready to go with agility – thanks!).

A scheduled outage of a few specific hours is consistent with my large customer-facing experiences (e.g. all ATMs down, or even all credit card processing in rare cases). Fully redundant systems with “instantaneous” switchover was always preferable, of course, for mission critical, customer facing systems. I think the credit card example was a particularly avoided scenario, as $ millions of transactions are blocked per minute of downtime.

The hope I endless repeat (sorry) is for distributed redundancy (either inherent to the SmartThings architecture or via a “best practices guide”, so customers can “always” operate their lights, locks, receive and reset alarms… well, you know…, during scheduled and unscheduled outages ).

Thanks for the update and sincerely wish you have a very smooth upgrade night and lots of rest afterwards.

(I most certainly am a part of the community that “appreciates this”.)

twack · October 29, 2014, 7:50am

I have this mental picture of the bridge of the Starship Enterprise with Capt.Kirk (Alex) on his tricorder yelling to Scotty (Urman) to give him warp speed now! Meanwhile Spock (Bob) and Uhura (Ben) search nearby planets for the dilithium crystals (Casandra). Of course there is cheesy background music too to signify the stressful situation. Dunt Dunt da da da da dunt dunt…

Oh man, I will not be able to unthink of Ben dressed in the Uhura outfit (with boots). What have I done???

P. S. Feel free to cast the rest of the characters.
Bones =James?
Chekov =?
Worf =?

Topic		Replies	Views
Smart apps not working tonight? SmartApps & Automations	35	4441	June 23, 2014
Getting frustrated with the Outages again Apps & Clients	204	9906	February 20, 2015
The Platform Doesn't Work Projects & Stories	102	7730	February 25, 2015
T- 48 hours and nothing but crickets from Smartthings General Discussion support	177	11475	March 21, 2016
Stability Improvements from SmartThings Announcements	129	21073	April 29, 2015

SmartThings Downtime and What We're Doing

Related topics