Post Mortem From 6/15/2016 Outage
Notes From Cloud Engineer @vlad
At 3:30 our monitoring tools alerted us that our API cluster went down. This is the part of the platform
that serves graph.api.smartthings.com & mobile devices. This is when consumers would have begun to notice the crash.
At this time our device cluster also began to struggle and monitoring tools alerted a spike in database connections.
Engineering identified our caching layer as the source of the increased load on our databases. Operations
on the caching layer began to fail, which pushed an overwhelming amount of traffic to our databases. This
resulted in many operations timing out and decreased throughput.
Upon further investigation, engineering identified a pattern: After a Cache server threw a certain exception,
all existing operations were cancelled and the server would be marked as dead. Connected API nodes would then
connect to a different Cache server. These crashes happened across the caching layer quickly - engineering began
rolling deploys across the caching cluster.
Engineering then identified the root cause of the issue:
We hit a value limit in our cache servers for a specific type of object. When our API cluster attempted
to save this object in the cache an exception occurred and the Cache client marked the node as down, cancelling all
existing operations on that server.
As the build was running for the code change to protect against reaching this threshold
hubs began to go offline. (a consequence of an unstable caching layer + increased load on our databases causing a slowdown in queue processing)
We validated the code change across lower environments and pushed to Production. At this time hubs began to report online
as queue processing returned to a normal level. Platform performance returned to normal as the queue backlog was processed.
Thanks for bringing up transparency - hopefully this explanation is satisfactory (there was more mitigation/behind the scenes stuff that I didn’t mention).