First few weāve spot checked are database related event save failures - in CoREās case its happening when it calls (either in timeHandler or recoveryHandler):
sendLocationEvent
Donāt think the failures are limited to scheduler related at this point.
It may not be limited to the scheduler part of the platform, but from the customerās point of view itās the same end result: a routine/smartapp that was scheduled to run, didnāt.
@vlad get ready for a long list of stuff that happened:
This morning, at 7:00 am, my Good Morning routine was supposed to happen. My notifications shows āGood morningā but none of the light changes that are supposed to take place with it. It did not set my alarm to āunarmedā and it did not change my mode to āHomeā from āNight.ā
at 7:35am~ish, kitchen door opens, sets off alarm since alarm still set.
at 7:40 I manually clear the alarm.
at 7:40 also, I manually click routine āGood Morning.ā Nothing happens. No lights change. Cannot change alarm mode.
at 8:55 kitchen door intrusion detected since alarm still set. Cannot disable alarm or dismiss the alert because the page doesnāt even load on my android smart things app.
at 9:55 I get a reminder that thereās been an intrustion. Still canāt clear it, page wonāt load.
at 10:55 I get a reminder that thereās been an intrustion. Still canāt clear it, page wonāt load.
at 11:55 I get a reminder that thereās been an intrustion. Still canāt clear it, page wonāt load.
at 12:55 I get a reminder that thereās been an intrustion. Page loads, get an error trying to clear the alarm. Trying to refresh the page to see if alarm is cleared, aaaaand page wonāt load again.
At this point, itās unreliable enough that my friends who I talked into getting smartthings are doubting the usefulness of this product.
It somewhat amazes me that we find new ways to break scheduling and that things like this arenāt monitored. Itās not always just the health of the machines/os running but the details in the DB.
I would have thought a growing number of failures would instantly set off some red flags.
At what point does this become proactive vs reactive when people are already annoyed and complaining.
Yesterday (October 13) a SmartApp which had been running for 2 years broke down.
I tried to execute it from the IDE, but got obscure error messages (2 different).
I contacted support, with a strong suspicion of some SmartThings cloud overload, but once they saw ācustom SmartAppā, they declared it was not their problem !
I posted my problem in this other thread : Cassandra timeout during read query?, since one of the 2 error codes I got was a āCassandra timeoutā.
Interestingly (?!!), up to now, this periodic smartApp would often fail to schedule (it runs every 5 days), but when it did it executed properly. Now it schedulesā¦ and aborts !
Note sure I would label that as āprogressāā¦
Now my system didnt go to āhomeā when i arrived home. Left, it went to away, and armed itself with SHM. Got back, opened garage door, boom intrusion detected. Thanks STā¦ basic geo location and firing of routines not working. Had been working basicly foreverā¦
Investigating - Some North American users may be experiencing issues with loading resources in the mobile app and web UI, arming/disarming Smart Home Monitor, and the execution of SmartApps. The engineering team is working on the issue and we will provide an update shortly.
Oct 14, 14:19 EDT
If so, can someone articulate what is causing this?
Really annoyingā¦ this morningā¦
Mode and/or SHM status did not change, when CoRE Piston executed. Re-Ran piston several times and did not fix it.
Hub displayed a notification the hub went offline, which if it did it was the ST Hub or the Cloud, not my internet. Amazingly, the app showed the hub was online (live status) and turning on a light with the app worked.
Lights took a very very very long time to turn on in response to motion (smart lighting) so I manually rebooted the hub.
Sadly, this ended a good 75 day run I had with basically no issues.
Establishes again that ST canāt be relied upon for important functions. A huge reminder, itās entirely possible the system will not inform me when an intrusion occurs even if every other required support system is up and running (power, internet, etc) - and perhaps even worse could set off a false intrusion and siren, etc and disturb or scare the family.
Same @JH1 I seemingly have been very stable and when people were complaining my setup had been very reliable since finding the Cree/Osram bulb bug. Switching most my bulbs to the Hue Bridge had made my home like dang near 100% for months now, easily 3 months. Last few days things have been super slow, aka hue delay. And then starting last night tons of failures. Routines not firing, stock basic automations not working or partially working. yet the status page is āinvestigatingā. I love the spin on it, dont say we have a problemā¦ Hell the IDE has been unusable for meā¦ Cant view Hub, events, and other parts of the page just give 500ās. But hey, lets investigate.
While I am both curious and of course annoyed by additional outages, to put it in perspective, if all I endure is what happened this morning I will survive. Relatively speaking, comparing to some of the outages and degradation I have been through.
Investigating is good, acknowledging there is an issue and seeking to understand cause is good. Letās see if discovered cause is communicated clearly and if the issue remains on status until truly resolvedā¦
About a week ago my mode didnāt change onceā¦ and I believe it was ST. Thatās the only problem Iāve had in weeks (months maybe?). I have a thermostat that was acting wonky but Iām pretty sure it wasnāt ST because my other two were fine.
Update - While the connectivity and SmartApp execution issues have improved, we are continuing to investigate additional performance improvements.
Oct 14, 22:30 EDT
Okay, can someone share the information that backs up this statement?
Yea - one of the issues was āfixedā by replacing 3 nodes in our events cluster for na01. These are read timeouts from that cluster. Reducing this should have helped with the UI, IDE, smartapp executions, etcā¦
It doesnāt look like the replacement fixed the issue here as the timeouts are still elevated and follow a cyclical pattern. (Can see other Cassandra metrics trending upwards again). This mainly affects execution and the chance of it occurring increases in proportion to the rate of created events per app execution. So while many smartapps/devices are firing fine certain ones are hit more often (the ones that create more events). We have a change that can go out to make the event creation async which would help with executions but that could have a number of unintended side affects and would change the behavior of the core part of the system - it would be much safer to figure out what happened in the last couple of days to cause these spikes.
bamarayne
(Jason "The Enabler" as deemed so by @Smart)
91
I havenāt seen a lot of problems.
The app had been slow. Iāve gotten a few timeouts unable to save pages. Other than that the app had been ok.
The ide had been just fine. Iāve been in it literally all day and night working on a project. No issues there at all.
My system routine to change the mode to night mode at 2030 tonight failed to change the mode. My mode manager piston detected it and corrected the mode within a couple of minutes. I only knew it haired because the mode manager sends me a message.
Other than that my system had been running spot on.