A Message from Alex on Platform Improvements and Our Plan Forward

First, I want to let you know that everyone at SmartThings is fully aware of the issues that have been affecting platform reliability. In the past few weeks, we’ve redoubled our efforts to make some fundamental improvements that will soon be felt by all of you in your everyday experience with SmartThings. Know that these improvements are only the beginning, that we are in this for the long term, and that we are absolutely committed to building the best, most open platform in the world.


It’s just scratching the surface, but I’d like to take you through the technical details of some of the changes we’ve implemented recently:

Smart Home Monitor
We’ve made a few changes to Smart Home Monitor that should help to resolve issues with data consistency, load time, and Arming/Disarming.

The first is really about returning Smart Home Monitor to normal operations. When we saw that the database load was high about a month ago, we made a change to help reduce platform load, which lowered data consistency as a trade-off. In practical terms, this meant that the platform performed better, but that for some users state information was not accurately reflected in our mobile app. We have now reverted this change to allow Smart Home Monitor to once again check against multiple database servers before deciding the current Smart Home Monitor state.

The second change we made is to the way Smart Home Monitor subscribes to events. We made the subscriptions leaner between the parent app, child apps, and devices. In short, you should see more reliable Arming/Disarming of Smart Home Monitor.

Database Capacity Increase
We added more nodes to our database clusters, resulting in a 60% increase in overall cluster size for our main database ring. The augmented database should reduce latency and errors across the SmartThings platform. We also set up and deployed a full additional instance of our North American infrastructure to balance customers between instances and create further room for scale.

Thrift to CQL Migration
When we first built out our Cassandra infrastructure, we used an API called Thrift. At the time, this was the recommended way to work with Cassandra, but it has since been replaced by CQL as the recommended API. Since Thrift has been deprecated, SmartThings is working to transition our tables to CQL, starting in our highest impact areas. CQL is a better way to use Cassandra that should help to relieve the stress on our database. While this change is currently in progress, it’s a large undertaking, so we hope to have it ready within the next couple of months.

New Scheduling Tool
Over the past few weeks, we moved all SmartThings users onto a new version of our scheduler, which we call Ticker. Ticker is a more robust and isolated scheduler than our prior Cassandra-based system, bringing you more reliability than ever with scheduled Routines and SmartApps. Since moving, we are now processing 100% of schedules with extremely low error counts. Additionally, turning off our old system has contributed to the overall health of our Cassandra data clusters by reducing hotspots on Cassandra database tables. Similar improvements are coming soon for SmartApps that are triggered by events. Check out more details on the new scheduler here.

G1 Garbage Collection
We also implemented the new G1 Garbage Collection algorithm (which is designed for high memory servers) on our Cassandra cluster servers. The G1 algorithm has received strong JVM community support, and since implementation, we have seen positive results on the platform.


This is just a partial summary. Many more improvements are coming in what I expect to be a regular drumbeat over the coming days and weeks and months. I am also committing us to increased transparency, especially as we need your help and feedback. We will provide a weekly update to the community over the coming months so that you can have an understanding of what we are doing.

At a macro level, part of our challenge has come from the very thing that is our biggest strength: our commitment to be the most open smart home platform in the world. That very openness that has lead to so many of your innovative apps and solutions has also created challenges. It has shown us where our platform architecture needs to mature further to accommodate your innovations while also ensuring world-class reliability and performance for all customers. It has also shown us where we need to provide better developer tools in the near future which will help us to have certainty and take accountability for our platform while also enabling you to know where errors might be arising within your specific apps and contributions.

Thank you for your patience and continued contributions as we scale. We’ve made the improvements to your service and the basics of our platform our top priority. By doing this and working with you, we’ll make the world smarter, together.

-Alex

62 Likes

This sounds like there’s some system “vote” on what the state of my SHM monitor is. Why does it need to vote to determine if I pressed arm or disarm? Is this related to why all of my devices generate random events? ST has to consult multiple sources to figure out if they are “real”?

I guess I was hoping for a “we fixed it” not a “we did some stuff and we’re hopeful that the number of times it doesn’t work will be reduced”

Sorry this sounds so negative, but I’ve been with you guys for 2.5 years now, it’s been the same frustrating story that whole time.

7 Likes

On my system it has gotten worse than it has ever been. The SHM does not disarm with an regularity. Even after I finally get it disarmed it may trigger an alarm. This has been going on for over week. I have disconnected my siren from the system and disabled push notifications on SHM. Even that hasn’t stopped the system from pushing notifications to my phone as well as my SO’s phone. I am not a happy camper right now.

3 Likes

Those issues are representative of some of the worst issues a subset of people have experienced. Please let us know whether things are better now as a result of the releases today. I expect that they will be. Many additional steps are being taken beyond SHM. We’ll stay at it with you until it is the wonderful experience you expect.

3 Likes

Also, you must have about 65,536 employees now considering the number of times you’ve “doubled your efforts” to make things reliable :slight_smile:

6 Likes

The arming / disarming bug was separate from the state consistency issue. We have taken steps to alleviate that as well (the streamlining of subscriptions comments), with some additional improvements coming next week. It should be better.

I’m not promising that “we fixed it” until we have rooted through a range of issues. I do know that it is getting better and that we will be making improvements with very high frequency. We will provide updates with detail here in the community every week until there is a much greater sense from our long term supporters that we are in a better place than we have ever been to-date.

Hah! Not quite, but I will say that part of it has been eliminating all other distractions so that this receives all of the focus that it deserves. In addition to that, we are hiring, a lot.

1 Like

I highly recommend you reach out to Bruce and offer him a position…

14 Likes

Nah, he deserves more than just a job. I say set him up. Offer to buy Rule Machine from him, AND he gets perpetual royalties on top of that. :slight_smile:

8 Likes

I have reached out and we want to support Bruce deeply as well as others that have contributed so much.

Also, just to state it here. The issues with Rule Machine were OUR platform and our fault, not something with that app. We will be providing some tools soon that will really help developers and users to see where the remaining issues actually are, but in advance of that, I want to apologize for any time that we haven’t closed the loop all the way through to our support or other teams on what is a SmartThings platform issue versus an issue with a specific SmartApp.

There are examples of SmartApps that have problems in the apps themselves which are causing problems for the entire user base. In those cases, we need to provide better tools for those developers as well as isolation of the infrastructure such that they don’t impact other users.

We are in the weeds on this and will get better.

43 Likes

I’ve had that issue since June of 2015… I gave up and removed everything, only to have the $&@%! Siren go off late on Christmas Eve. I had removed it from every smart app and SHM months before. My pleas to support were ignored until long after the logs were overwritten. I only got help after blasting ST repeatedly on the form and FB. Support ultimately blamed the issue on my wife’s phone…

When this most recent fiasco began, I was pleasantly surprised to reach support quickly for the first time since V2 came out. Support blamed the issues on my light bulbs and wanted me to remove and re-add a bunch of stuff. It turns out that ST was aware of the issue for hours by the time they told me that. Either nobody bothered to inform support or I was lied to.

I’m glad to see ST accepting responsibility, but pardon me if I am skeptical of ST’s sincerity.

2 Likes

@alex myself (and I’m sure many others) really appreciate not only your post but the commitment to transparency. I honestly have only one questions to ask. As someone who founded and is so passionate about SmartThings - how did you ever let it get this bad? I don’t mean that in a negitive connotation but as a serious question. What trade-offs did you decide to pursue (features, compatabiltities, budget, etc) to put first over the reliability and scalability of your ‘cloud’ and why? (Ok, two questions) It’s a loaded question but I’m sure many of us have been wondering what could possibly be more important than the reliability and reputation of your company.

13 Likes

Thanks for the update. The only thing we can do is wait and see. I like ST enough to still have 2 other systems still in unopened boxes. Hope you guys ironed out all the bugs.

@HillbillySUV

We know we have to work hard to earn your trust back and we have a large focus on doing just that. As Alex said, we’re in the weeds of all of this right now, and plan on making meaningful and incremental improvements to the platform until we reach a place of measurable platform stability and reliability.

Support can get busy and we did have a large traffic bump in ticket counts into December and the first of the year, while I am glad to hear support got to your latest request in a timely manner, we are still growing that team to make it even more effective.

2 Likes

That’s a really thoughtful couple of questions. I may need to do a post soon about just that. I can say that it was a combination of having too many things and opportunities on the plate, and even more so just not recognizing the root level of some of the problems. We have now taken all of these elements to heart both eliminating distractions, scaling the teams, and much more.

I am grounded now in that you all come first, no matter what. We are going to build the best and most open platform in the world over the long haul, and that starts with the basics of working together well right here. In the meantime, I am so thankful to everyone who has supported us through our evolution.

25 Likes

I stopped using SHM when disarming it turned lights on. And there was nothing to account for the lights being turned on. There was no custom rule, no Smart Lighting app, no RM Rule, anything to account for this.

Will this change stop this action?

1 Like

I hope we can keep this thread positive. Would be nice if we can accept Alex’s positive outlook and apologies and get SmartThings back to where it belongs. We all want this to succeed. I am thinking this is a good start.

25 Likes

How about you let me return my kit? Then if you’ve got things sorted out in a year or two I can always buy a new one.

I really would like to see the platform succeed, but it’s become something that I did not sign up for.
So how about it? I’ve still got all the original boxes and everything.

1 Like

12 posts were split to a new topic: Status Page Resolutions