Scheduled jobs failing (again) (again šŸ˜„) (Ongoing Known Issue)

First few weā€™ve spot checked are database related event save failures - in CoREā€™s case its happening when it calls (either in timeHandler or recoveryHandler):

sendLocationEvent

Donā€™t think the failures are limited to scheduler related at this point.

It may not be limited to the scheduler part of the platform, but from the customerā€™s point of view itā€™s the same end result: a routine/smartapp that was scheduled to run, didnā€™t. :disappointed_relieved:

So how should people report these problems?

Related to was a bad choice of words - changed the wording to ā€œnot limited to schedulerā€. You should still contact support when you see a failure.

1 Like

@vlad get ready for a long list of stuff that happened:

This morning, at 7:00 am, my Good Morning routine was supposed to happen. My notifications shows ā€œGood morningā€ but none of the light changes that are supposed to take place with it. It did not set my alarm to ā€œunarmedā€ and it did not change my mode to ā€œHomeā€ from ā€œNight.ā€

at 7:35am~ish, kitchen door opens, sets off alarm since alarm still set.

at 7:40 I manually clear the alarm.

at 7:40 also, I manually click routine ā€œGood Morning.ā€ Nothing happens. No lights change. Cannot change alarm mode.

at 8:55 kitchen door intrusion detected since alarm still set. Cannot disable alarm or dismiss the alert because the page doesnā€™t even load on my android smart things app.

at 9:55 I get a reminder that thereā€™s been an intrustion. Still canā€™t clear it, page wonā€™t load.

at 10:55 I get a reminder that thereā€™s been an intrustion. Still canā€™t clear it, page wonā€™t load.

at 11:55 I get a reminder that thereā€™s been an intrustion. Still canā€™t clear it, page wonā€™t load.

at 12:55 I get a reminder that thereā€™s been an intrustion. Page loads, get an error trying to clear the alarm. Trying to refresh the page to see if alarm is cleared, aaaaand page wonā€™t load again.

At this point, itā€™s unreliable enough that my friends who I talked into getting smartthings are doubting the usefulness of this product.

3 Likes

It somewhat amazes me that we find new ways to break scheduling and that things like this arenā€™t monitored. Itā€™s not always just the health of the machines/os running but the details in the DB.

I would have thought a growing number of failures would instantly set off some red flags.

At what point does this become proactive vs reactive when people are already annoyed and complaining.

5 Likes

Yesterday (October 13) a SmartApp which had been running for 2 years broke down.
I tried to execute it from the IDE, but got obscure error messages (2 different).
I contacted support, with a strong suspicion of some SmartThings cloud overload, but once they saw ā€œcustom SmartAppā€, they declared it was not their problem !
I posted my problem in this other thread : Cassandra timeout during read query?, since one of the 2 error codes I got was a ā€œCassandra timeoutā€.
Interestingly (?!!), up to now, this periodic smartApp would often fail to schedule (it runs every 5 days), but when it did it executed properly. Now it schedulesā€¦ and aborts !
Note sure I would label that as ā€œprogressā€ā€¦:weary:

Now my system didnt go to ā€˜homeā€™ when i arrived home. Left, it went to away, and armed itself with SHM. Got back, opened garage door, boom intrusion detected. Thanks STā€¦ basic geo location and firing of routines not working. Had been working basicly foreverā€¦

Is this the acknowledgment of this issue?

Investigating - Some North American users may be experiencing issues with loading resources in the mobile app and web UI, arming/disarming Smart Home Monitor, and the execution of SmartApps. The engineering team is working on the issue and we will provide an update shortly.
Oct 14, 14:19 EDT

If so, can someone articulate what is causing this?

Really annoyingā€¦ this morningā€¦

  1. Mode and/or SHM status did not change, when CoRE Piston executed. Re-Ran piston several times and did not fix it.
  2. Hub displayed a notification the hub went offline, which if it did it was the ST Hub or the Cloud, not my internet. Amazingly, the app showed the hub was online (live status) and turning on a light with the app worked.
  3. Lights took a very very very long time to turn on in response to motion (smart lighting) so I manually rebooted the hub.

Sadly, this ended a good 75 day run I had with basically no issues.

Establishes again that ST canā€™t be relied upon for important functions. A huge reminder, itā€™s entirely possible the system will not inform me when an intrusion occurs even if every other required support system is up and running (power, internet, etc) - and perhaps even worse could set off a false intrusion and siren, etc and disturb or scare the family.

ELI5?
:blush:

1 Like

Same @JH1 I seemingly have been very stable and when people were complaining my setup had been very reliable since finding the Cree/Osram bulb bug. Switching most my bulbs to the Hue Bridge had made my home like dang near 100% for months now, easily 3 months. Last few days things have been super slow, aka hue delay. And then starting last night tons of failures. Routines not firing, stock basic automations not working or partially working. yet the status page is ā€œinvestigatingā€. I love the spin on it, dont say we have a problemā€¦ Hell the IDE has been unusable for meā€¦ Cant view Hub, events, and other parts of the page just give 500ā€™s. But hey, lets investigate.

3 Likes

While I am both curious and of course annoyed by additional outages, to put it in perspective, if all I endure is what happened this morning I will survive. Relatively speaking, comparing to some of the outages and degradation I have been through.

Investigating is good, acknowledging there is an issue and seeking to understand cause is good. Letā€™s see if discovered cause is communicated clearly and if the issue remains on status until truly resolvedā€¦

Litmus testā€¦

@bamarayne @bridaus

Report Status!

2 Likes

About a week ago my mode didnā€™t change onceā€¦ and I believe it was ST. Thatā€™s the only problem Iā€™ve had in weeks (months maybe?). I have a thermostat that was acting wonky but Iā€™m pretty sure it wasnā€™t ST because my other two were fine.

1 Like

@bridaus
Try to go to IDE and click on hubsā€¦ I get:
Oh No! Something Went Wrong!
Error
500: Internal Server Error

I also have issues going to a Thing in the mobile app and viewing the device logs in the Recently tabā€¦ takes forever, fails,etcā€¦

Check it out, let us know resultsā€¦

Update - While the connectivity and SmartApp execution issues have improved, we are continuing to investigate additional performance improvements.
Oct 14, 22:30 EDT

Okay, can someone share the information that backs up this statement?

Still seems like a mess to me now I am losing state of my pistons. Having to rebuild them all as my "if"s are magically empty.

1 Like

I am beginning to think the status update can be translated to:

ā€˜Curious issue. Weā€™re tired, going to bed. Catch up Mondayā€™

Yea - one of the issues was ā€œfixedā€ by replacing 3 nodes in our events cluster for na01. These are read timeouts from that cluster. Reducing this should have helped with the UI, IDE, smartapp executions, etcā€¦

The other issue that weā€™re seeing is timeouts for saving events:

It doesnā€™t look like the replacement fixed the issue here as the timeouts are still elevated and follow a cyclical pattern. (Can see other Cassandra metrics trending upwards again). This mainly affects execution and the chance of it occurring increases in proportion to the rate of created events per app execution. So while many smartapps/devices are firing fine certain ones are hit more often (the ones that create more events). We have a change that can go out to make the event creation async which would help with executions but that could have a number of unintended side affects and would change the behavior of the core part of the system - it would be much safer to figure out what happened in the last couple of days to cause these spikes.

4 Likes

I havenā€™t seen a lot of problems.

The app had been slow. Iā€™ve gotten a few timeouts unable to save pages. Other than that the app had been ok.

The ide had been just fine. Iā€™ve been in it literally all day and night working on a project. No issues there at all.

My system routine to change the mode to night mode at 2030 tonight failed to change the mode. My mode manager piston detected it and corrected the mode within a couple of minutes. I only knew it haired because the mode manager sends me a message.

Other than that my system had been running spot on.

1 Like

My widgets that run some of my routines have disappeared from my iPhone as well. I had to set them up again.

This is the issue Iā€™ve seen the last few days. So far, automations are fine, things are just sluggish updating on the mobile app.

Still happening this morningā€¦

1 Like

Itā€™s probably your project that is causing the issue, see Vladā€™s charts.

Any correlation to Project Frankenstein?

1 Like