Apologies for missing last week’s post while I was traveling. This week’s post will include details from both this week and last week.
First, I want to talk about the downtime we had on Tuesday, June 14th. The downtime was caused by our weekly SmartApp and Device Handler deploy. These deploys are routine and normally go without incident, but last week’s deploy caused us to hit a limit on our caching layer which caused the login and device control problems some of you were seeing. The caching client we use has a limit of 1MB and the total size of our approved Device Handlers grew larger than 1MB causing our platform to behave erratically. We recognized the problem right away and were able to create a hotfix and deploy it as quickly as possible.
Although this was unfortunate, the upside is we learned from it and put in safeguards to make sure we don’t see this happen again. We appreciate your patience during our brief downtime.
This week we have also seen a few production issues that have impacted a subset of our customers. Our API cluster is seeing periods of an abnormally high amount of database connections. These spikes happened after this week’s platform release and, while these spike seem to happen at random, we are currently looking into every change that happened during the deploy to find the root cause. We will update you with more information when we have it.
Device Handler Improvements
As part of the push for improvement on the basics, we’ve begun to turn to nuanced problems specific to individual 3rd party device integrations. Here are a couple of examples where there was strong progress in the past week.
Netatmo Updates
We found and resolved a few problems with the Netatmo integration this past week. The Netatmo Device Handler and Service Manager were throwing approximately 100k backend errors per day and we were able to cut that down to basically zero. Not only does this help the platform but it also improved the entire Netatmo integration experience.
Iris Smart Plug
We worked with @blebson to resolve another high error count Device Handler. There was a Null Pointer Exception being throwing in some cases creating 250k errors over a 24 hour period. We were able to work with @blebson and the consumers of his Device Handler and in a week’s time ⅓ of Iris Smart Plug users updated to the revised version of the Iris Smart Plug and we have reduced those errors significantly.
Documentation
We have recently reviewed the best practices we have established internally when writing on our own platform and found that many of them haven’t been clearly reflected in our external docs. As a result, we have released new documentation that outlines the best practices for writing code on the SmartThings platform, which is available to everyone. Let us know how we could make our docs even better.
Finally, I want to point out that we love hearing stories like the one below shared by @sgnihttrams. It is the reason I and the founding team founded SmartThings and the reason we continue to drive to be the greatest Smart Home platform in the world. This story makes it even more clear in my mind why the basics are so important. Thanks for sharing @sgnihttrams!
See you all next week with more data and improvements!
-Alex