Inconsistent operation of SmartThings

Hi all. I wanted to weigh in here and let you all know what has been going on. We have been having some issues, as Cory references, with our event queues “backing up”. Thankfully this is not due to our infrastructure being overloaded, but rather due to issues that we are having with the Amazon DynamoDB NoSQL Data Store that we use for events. DynamoDB has been “throttling” our writes to the data store and this, in turn, causes delays in apps firing, etc.

Our engineering team is working actively with Amazon to understand the root cause of this I/O throttling. As a short-term measure, we’ve raised our I/O limits on the affected data stores. We believe that this will alleviate the issues that we have been seeing until we can get the long-term fix in place. We know that the long-term fix is actually to move to a different technology for the event feed storage (to a technology called Cassandra), and we actually already have that running in a test environment. Until that is ready for prime time however (e.g. fully tested), we are closely monitoring the DynamoDB issue to ensure that this kind of variability in performance and responsiveness of our platform doesn’t happen.

It’s especially frustrating for our users when the platform doesn’t respond like it should, and it’s equally frustrating for the engineering team to have a third-party service like DynamoDB be the cause of the issue, when our own Cloud Servers are running very “cool” in terms of CPU utilization. We’ve put a lot of energy into building a platform that will scale to many millions of users, and I want to assure all of our customers that when things like this happen, there is an entire team of people that are “swarming” to resolve them as fast as humanly possible.

Regards,

Jeff Hagins
Founder & CTO
SmartThings