Just wanted to stop in and give you a heads up of a database change happening tomorrow, 1/22/16
We are upgrading and adding more database clusters to our environment to alleviate some of the “pressure” we are having on our current database clusters.
Unfortunately the current data is not easily “migratable” so when we make this change your event history for the last seven days will not be available. This event history will begin to build again once the migration is complete. This change should not affect control of your devices or any automations you have setup. You can expect this change to happen around 10am PST on 1/22/16.
The goal of this change is to help a lot of the platform instability you have seen that last couple days. We are confident this will resolve a lot of this problems you may be seeing.
We apologize for any inconvenience this might cause for you. Please reach out to me, @slagle, if you have any questions or concerns. We thank you in advance for any patience you can give us.
I agree. I’m starting to see some positive changes happen finally. I notice a pretty large performance increase in device response since yesterday’s update. I realize things have been rocky this last week but for the most part everything has been working for my home and I have just over 100 devices.
So for once I would like to give SmartThings some positive posts and props on definitely seeing improvements to the platform. Fingers crossed that the update goes off without a hitch today
Although I’m happy to see ST has found the cause and is addressing it, the IT Director side of me is a bit disappointed. Server loads, whether it is processor, memory, network, or app specific utilization, are something that can easily be monitored and trended. (is that even a word?) Although I can’t say for sure since I’m not sitting in their meetings and I tell my team “I want facts, not feelings,” it “feels” like ST tends to be more reactionary rather than proactive when it comes to back-end infrastructure. I was hoping that the Samsung purchase would have changed that and given ST the resources to quickly scale both strategic and on-demand infrastructure. Perhaps there are other issues in the relationship here that are afoot. (my Sherlock word for the day)
So, again, I’m very happy to see progress on this issue, but I’m a bit saddened that issues have to become very massive before a change is made.
I don’t think they have started to to the database upgrade yet so any performance improvement is not a result of that. Later on today will be the real test. I do think these Friday afternoon upgrades is a crazy time to be doing these things.
I have the same thoughts myself(Granted i’m not a Director). My inability to ignore these technical issues would have caused me to fix this stuff months ago, but then again I don’t have the giant Samsung standing over my shoulder telling me what my agenda will be. Either way things can’t get much worse with the exception of a total platform outage, it can only get better from here.
Maybe not to a fully functional system, but that’s not quite what we have here. If they don’t do anything, the platform’s hosed for another three days until Monday. I applaud their willingness to abandon read-only Friday! =)
The engineer in me is dying to know the specifics of what’s going on in the infra. I’ve never been through a ‘upgrading and adding more database clusters’ operation where there was data loss. Indeed, the entire point of a cluster is that you can scale it horizontally without having do introduce downtime or data loss.
From what I can sleuth out, ST uses MySQL, and Cassandra. Cassandra, being a NoSQL database, is likely the datastore responsible for keeping our event data. Cassandra was architected from the ground up to be easily scaled, so I wonder what they’re doing to the cluster that is making data loss a possibility.
To be clear, I’m not criticizing. I could care less about that data, and a trade for stability is one I’d make every time. Something tells me that this is more than just scaling out the database layer though.
@slagle’s post about ‘learning a lot’ from this last issue tells me that they identified something big, and are making moves to fix it. When the dust settles, I’d love to see a post-mortem about what all was done to fix things. Curious geeks want to know!!!