First, I want to let you know that everyone at SmartThings is fully aware of the issues that have been affecting platform reliability. In the past few weeks, we’ve redoubled our efforts to make some fundamental improvements that will soon be felt by all of you in your everyday experience with SmartThings. Know that these improvements are only the beginning, that we are in this for the long term, and that we are absolutely committed to building the best, most open platform in the world.
It’s just scratching the surface, but I’d like to take you through the technical details of some of the changes we’ve implemented recently:
Smart Home Monitor
We’ve made a few changes to Smart Home Monitor that should help to resolve issues with data consistency, load time, and Arming/Disarming.
The first is really about returning Smart Home Monitor to normal operations. When we saw that the database load was high about a month ago, we made a change to help reduce platform load, which lowered data consistency as a trade-off. In practical terms, this meant that the platform performed better, but that for some users state information was not accurately reflected in our mobile app. We have now reverted this change to allow Smart Home Monitor to once again check against multiple database servers before deciding the current Smart Home Monitor state.
The second change we made is to the way Smart Home Monitor subscribes to events. We made the subscriptions leaner between the parent app, child apps, and devices. In short, you should see more reliable Arming/Disarming of Smart Home Monitor.
Database Capacity Increase
We added more nodes to our database clusters, resulting in a 60% increase in overall cluster size for our main database ring. The augmented database should reduce latency and errors across the SmartThings platform. We also set up and deployed a full additional instance of our North American infrastructure to balance customers between instances and create further room for scale.
Thrift to CQL Migration
When we first built out our Cassandra infrastructure, we used an API called Thrift. At the time, this was the recommended way to work with Cassandra, but it has since been replaced by CQL as the recommended API. Since Thrift has been deprecated, SmartThings is working to transition our tables to CQL, starting in our highest impact areas. CQL is a better way to use Cassandra that should help to relieve the stress on our database. While this change is currently in progress, it’s a large undertaking, so we hope to have it ready within the next couple of months.
New Scheduling Tool
Over the past few weeks, we moved all SmartThings users onto a new version of our scheduler, which we call Ticker. Ticker is a more robust and isolated scheduler than our prior Cassandra-based system, bringing you more reliability than ever with scheduled Routines and SmartApps. Since moving, we are now processing 100% of schedules with extremely low error counts. Additionally, turning off our old system has contributed to the overall health of our Cassandra data clusters by reducing hotspots on Cassandra database tables. Similar improvements are coming soon for SmartApps that are triggered by events. Check out more details on the new scheduler here.
G1 Garbage Collection
We also implemented the new G1 Garbage Collection algorithm (which is designed for high memory servers) on our Cassandra cluster servers. The G1 algorithm has received strong JVM community support, and since implementation, we have seen positive results on the platform.
This is just a partial summary. Many more improvements are coming in what I expect to be a regular drumbeat over the coming days and weeks and months. I am also committing us to increased transparency, especially as we need your help and feedback. We will provide a weekly update to the community over the coming months so that you can have an understanding of what we are doing.
At a macro level, part of our challenge has come from the very thing that is our biggest strength: our commitment to be the most open smart home platform in the world. That very openness that has lead to so many of your innovative apps and solutions has also created challenges. It has shown us where our platform architecture needs to mature further to accommodate your innovations while also ensuring world-class reliability and performance for all customers. It has also shown us where we need to provide better developer tools in the near future which will help us to have certainty and take accountability for our platform while also enabling you to know where errors might be arising within your specific apps and contributions.
Thank you for your patience and continued contributions as we scale. We’ve made the improvements to your service and the basics of our platform our top priority. By doing this and working with you, we’ll make the world smarter, together.
-Alex