SmartThings Community

SmartThings down again [30 August 2018]


(Matt) #1

Well it is down yet again.


#2

Yep, me too.


#3

The only way I can turn things on or off is via Alexa.


(ActionTiles.com co-founder Terry @ActionTiles; GitHub: @cosmicpuppy) #4

100% operational… So far. Fake news :stuck_out_tongue_winking_eye:.
(Also not working for me, despite Status).

https://status.smartthings.com


#5

This was a very short outage for me. About 10 minutes or so.


(vlad) #6

Thanks for calling this out - support just notified engineering about potential issues this evening after an agent saw this post. We did some digging and found the likely cause of the issues. If you would like to, dm me your ST username and I can confirm you were one of the users affected by this incident.

At 21:02 (central time) a node in a database cluster used for processing device events failed. This happens from time to time and in normal circumstances shouldn’t have any effect the platform but it seems to have failed in a way that caused the cluster to still route queries to it. This caused DTH executions to fail and since DTHs are the way that SmartApps communicate with your physical devices, it would have been very disruptive if it affected your system.

Timeline was roughly (central time):
20:55: Performance degradation begins
21:02: Node stops responding to health checks
21:05: Subsequent failed health checks triggers node removal from cluster
21:10: Cluster recovered

This occurred only in the na01 shard (https://graph.api.smartthings.com)

There’s definitely room for improvement here in regards to our response - after digging into the metrics there were red flags that should’ve caused our automated alerts to fire differently. Since a node crashing in a cluster is expected over time and shouldn’t cause any issues, it didn’t raise any concerns for the ops team. Next step is to investigate why additional alerts didn’t fire in this case. For example, we have alerts on deviation of the number of events processed per minute, which I’m surprised didn’t fire. Those types of alerts may have triggered a status page update/a more active response from us.

cc @tgauchat @allison


(ActionTiles.com co-founder Terry @ActionTiles; GitHub: @cosmicpuppy) #7

Super appreciate the detailed explanation, @vlad! Thank-you!

I’m on NA01.

Indeed, to me, the outage appeared to be about 15 minutes. I’m on Hub V1, so no local execution. Motion sensor failed to turn on a light. Next an Alexa command failed.

By the time I posted, a couple things were already “working”, but seemed delayed.

All this is consistent with your timeline.