This site must be run by Democrats… when they don’t like what you are saying they tell you to shut up, change the subject and try to pretend it never happened by sticking their fingers in their ears and shouting LA-LA-LA-LA…
Why else would you close the original topic labeled “ST CLOUD DOWN” unless you wanted to hide it from the front page?
Good Game “brand preservation” team. If only the product team did their job better you wouldn’t have had to do this and we would have no subject asking if the cloud was down?! I should make a bot to make a new post every day with that subject
Hey my automation works great, even when the power is out. Just got a notification that ‘things have quieted down for 30 minutes’. Well it has been really quiet since Thursday at 11 pm when the power went out! There must be a check at 8 am, because I’ve got the same message yesterday morning.
I will say this… I was on the verge of releasing a new Handler for LANNouncer. But the lack of stability and reliability, especially for phones as Presence Sensors, SHM and Echo, have eaten the time. Triggers not firing and conditions of routines changing on their own… I’m not quite at the point of abandoning ST, but I’m close to following @bravenel Bruce off the island.
It also doesn’t help my attitude that the ST staff silently deleted, without comment, my App Submission for LANNouncer after six months without ever even starting the review process. @alex may claim to be supporting us, but that was a bit of a slap in the face.
If the hub has been operating locally and then reconnects to your cloud account, any notifications from the time when you were not connected to the cloud account will then get sent at once.
This can result in some fairly strange impressions.
My power just came back and hour ago with this message:
Your SmartThings Hub at {{ location.name }} is now active
Apparently, it’s been so long since the power went out that it forgot my location already. And no, I don’t have miracle batteries. Actually I have no batteries in the hub.
@JDRoberts they must have a check point at 8am, because I’ve got the same message that things have quieted down for two days at the same time.
The biggest disadvantage of having an extended power outage is that it drains all of the batteries. I have lost between 30 to 50% of my batteries in 2 days. I noticed that many of my devices had led lights on during the outage. Perhaps they panicked at the loss of the connection to the controller.
I have seen nothing from ST on what caused the incident. Make no mistake. You knock out this many people and its not an outage, its an “Incident”.
I would feel much better knowing what caused it and what they plan for remediation. The order of the day is stone silence.
S.T. … Trust your self installer market isn’t so fickle it would abandon you if you were honest, forthright and forthcoming. Many of us understand how this stuff works and would like to know what/why. Leave it to us to explain to those who don’t.
3 Likes
tgauchat
(ActionTiles.com co-founder Terry @ActionTiles; GitHub: @cosmicpuppy)
53
Definitely waiting for…
Post-mortem detailed explanation of the cause of the outage and the cause of its duration and the plan to prevent recurrence.
I’ve been around here long enough not too expect much. Sometimes they do share some technical details, sometimes they just wave hands and brush it off. Truth is, these “incidents” have been happening quite regularly and will continue to happen. Several ST employees admitted that the platform is not designed to scale gracefully and the management is not up to the task to fix it. They’re just patching it here and there to limp along till the next “crisis”.
P.S. One of the best explanations I’ve heard from one of ST staffers was “cloud flatulence”. I guess they’ve been feeding it too much chili.
At 3:30 our monitoring tools alerted us that our API cluster went down. This is the part of the platform
that serves graph.api.smartthings.com & mobile devices. This is when consumers would have begun to notice the crash.
At this time our device cluster also began to struggle and monitoring tools alerted a spike in database connections.
Engineering identified our caching layer as the source of the increased load on our databases. Operations
on the caching layer began to fail, which pushed an overwhelming amount of traffic to our databases. This
resulted in many operations timing out and decreased throughput.
Upon further investigation, engineering identified a pattern: After a Cache server threw a certain exception,
all existing operations were cancelled and the server would be marked as dead. Connected API nodes would then
connect to a different Cache server. These crashes happened across the caching layer quickly - engineering began
rolling deploys across the caching cluster.
Engineering then identified the root cause of the issue:
We hit a value limit in our cache servers for a specific type of object. When our API cluster attempted
to save this object in the cache an exception occurred and the Cache client marked the node as down, cancelling all
existing operations on that server.
As the build was running for the code change to protect against reaching this threshold
hubs began to go offline. (a consequence of an unstable caching layer + increased load on our databases causing a slowdown in queue processing)
We validated the code change across lower environments and pushed to Production. At this time hubs began to report online
as queue processing returned to a normal level. Platform performance returned to normal as the queue backlog was processed.
Thanks for bringing up transparency - hopefully this explanation is satisfactory (there was more mitigation/behind the scenes stuff that I didn’t mention).
If @Vlad is related to The Impeler, then he gets a round of applause from @ady624 and I…
Thanks for the update!
1 Like
tgauchat
(ActionTiles.com co-founder Terry @ActionTiles; GitHub: @cosmicpuppy)
59
Wow… Sounds like quite the domino effect!
Thanks for the details; it certainly explains the time to resolve.
I sure do wish that the root cause could be preventable, or proactively detected and/or something done about the operating environment to keep a similar incident more contained?
Some members of the community know what that object was. . Received the email telling them to update the DTH.
Now on the matter of ST cloud and reliability in comparison to other platforms. ST is my fourth entry in to HA. Started off with simple TCP connected lights. While extremely simple, also very limited and unreliable even over LAN or using the TCP remote.
On to Wink, okay wide range of integration with other systems, but if you think ST has cloud issues, you have obviously never had a Wink hub or link. Even when it did work there was a minimum of 3 second usually 10 second delay from trigger to action. It did not take long ( okay about 6 months) for me to get tired of opening the front door and getting all the way through the house into the kitchen before the front hall light came on for me to decide it was NOT for me. Wink app had ( has) integration with other systems, but the number of devices it actually supports itself was extremely limited. Took me 2 months working with Wink techs to even get Schlage Z-wave locks added to supported devices. Even then it was only to show lock.unlock status to my knowledge you still can not actually program the locks through Wink. Cloud connection loss was always an issue with my internet never their cloud . At least ST will openly ( even if delayed ) admit when they are having an issue.
( I am not even going to count the experiment into Wemo that I got just because they worked with Echo before ST. )
On to Securifi Almond + . Another system that looks great on paper . A nice system if you have limited wants ( I won’t say needs because 99.9% of us don’t “need” any of this ). Big advantage everything is run locally. Bad thing about that is no integration with other things ( okay limited Hue support) They have now added a rule machine to be able to automate more things, but it was primarily a remote control with a central control panel last year.
They just now added Echo support but after 6 months of Beta testing it can only be used to change modes from away to home vice-versa. Sorry unlocking the door should change mode from away to home, I don’t need to announce " Echo change mode to home" when I arrive. No actual device control through Echo. Other than again extremely limited device support there is a ( unpublished and often denied ) hard limit of 75 devices ( combination of all IP, ZW and Zigbee before hub starts overloading, over heating and constantly shutting down. I was at 40ish IP devices and 35ish ZW/ZB when I started having probl oems. Once I removed all ZW/ZB and only had the 40ish IP devices it was stable. Now that I am up to between 70-80 IP devices using it solely as router I am back to having it constantly reboot and/or drop WAN access forcing me to go in SSH and clear it out packets every couple weeks or month depending on how many times devices have disconnected /reconnected to it.
Okay on to ST 9 months ago with release of V2 hub. Yes I got in on the upgrade deal because somebody mentioned it in the Securifi forum. I had looked at ST before but the cost and limited number of sensors on ST site scared me away. This was by far the best or worst decision I ever made. Yes there are bugs, but most of the time things just work. Yes I have 1 Lowe’s Iris motion sensor that seems to constantly lose its connection. Yes not every bulb always shows its proper state. Yes when internet goes down nothing works. However 99+% of the time it works as it should. The ability to add devices from (almost ) any supplier has been a blessing and a curse. Do I wish I could add my Nest protect smoke alarms ? Of course I do. Do I wish I could ( more easily ) use my IP and CCTV cameras as triggers ? Definitely. I know I could do a lot more than I am if I just took the time to sit down and map everything out. It’s " easier" to just keep adding another new SmartLighting " rule" when I add new devices than it is to sit down for a day and actually redo everything to organize it.
According to API I am up to 132 devices, a 400% jump in the last 9 months since I got ST, and I have at least a 20 more sitting in boxes waiting to be installed. If TCP had issues with 12 light bulbs, Wink couldn’t handle 20 lights, locks and PIR, Almond+ got confused at 35 what would they do closing in on 150 ?
I don’t know of any other system that will let me grab any device from any manufacturer and within a couple hours or days have it integrated. Look at the Bloomsky, they announce a sky cam and within a couple days the forum has it integrated ( something Bloomsky said they had tried and failed to accomplish on their own ) so it can be used to detect light, moisture and temp to trigger anything else in your system. ( Yes I was one of the victims of SHM using “all moisture sensors” had it rain at 2 am the first night Bloomsky was up so was woken up to search around the house trying to find the leak. )
As far as cloud issues, I can tell every time Amazon had Echoes on sale without ever even looking at Amazon website. 3-5 days after the sale the failure rate of Alexa goes up about 500%. Okay if I am in charge of the Alexa cloud and I know they are going on sale and we sold 5000 of them ( pulling # out of my @SS) Wednesday and they will all be getting activated Fri, Sat, Sun I would sure as Hell make sure I was ready for the increased load. I would not wait until Monday or Tuesday and look at all the failures in the log and then decide we needed to up our capacity. That is not how they do it though. So if Amazon can’t fix their cloud to have 100% reliability how can we expect better out of ST ( I don’t care if ST was bought out by Samsung)