Scheduled jobs failing (again) (again 😥) (Ongoing Known Issue)


#34

Had a routine fail to work this morning also.


(Gus) #35

Good Night routine did not execute the past two nights and this morning, Good Morning did not execute either. Good Night is time driven, Good Morning is sunrise driven. I changed times for Good Night but did not resolve. I don't have the patience or stamina to deal with the busy work that Support requires and at the end hearing "too bad, deal with it, we have more important issues". Of course, they say it much nicer but thats the gist of it so I give up and just hope it gets fixed.


#36

There are two topics on this subject now.

I had a routine and a piston fail so far today.


(Marc) #37

I am hoping to start a trend. I edited the title of this thread to [KNOWN ISSUE]. When it's fixed, someone should change it to [RESOLVED]


(Stacy Butera) #38

Thanks for the update regarding this situation. Hope it gets sorted quickly.


(John C) #39

From where I sit, things didn't "just break." I have the sense scheduled stuff began failing more frequently over the last week or two and this is on my two geographically diverse hubs. These are the same issues I've had for many months: a twice-per-day fan automation at my second home and smart lighting automatons here at the primary residence both randomly failing. Support and @Aaron have looked at them on numerous occasions, with no definitive improvement. Gotta give them both kudos for corresponding with me and I hope I was able to contribute something meaningful to their troubleshooting efforts...

@vlad kindly chimed in, above, and his response serves to calm some of us "propeller heads" who try to guess what is going on. Thanks, Vlad. The information you shared is appreciated.

As one of those old Grey Beards who deployed a lot of embedded software over the years for a number of mission-critical applications, both military and commercial, I have only the following to offer:

1) There is a major problem in the ST architecture: cloud, primarily, but also the hub firmware. I am hopeful the reason we're not seeing all kinds of whiz-bang features being added and support for every device y'all are clamoring for is that the bulk of the engineering team is feverishly working on "fixing" the underpinnings of this product to bring about dependable, reliable operation. There were (probably) a bunch of very painful lessons learned about reliability, scalability, and defensive programming we will never hear of. There is nothing official I've heard to support this, just my hunch.

2) There is a cultural shift going on at ST. Not unusual following an acquisition. New management, new & departing employees, and new objectives from the Mother Ship. The sudden absense of @alex from the forums and his promise of weekly reports has gone on so long that I'm convinced he is "on special assignment" or "spending more time with family" or something like that. Samsung is probably just not ready to announce it yet.

3) This stuff is still cool. I'm anxious to get on with further additions to my two locations! :grin:


Professional office set up
(ActionTiles.com co-founder Terry @ActionTiles; GitHub: @cosmicpuppy; NOT a SmartThings Employee.) #40

Well… At least he has “The Smartest Home in America”:

( http://video-api.wsj.com/api-video/player/iframe.html?guid=7A2BC378-F7BC-4A42-BCBB-1C830F82174B )


(Never Trust @bamarayne) #41

yea, it's so smart it knows the previous light on/off patterns from when he is home and it emulates them when he is not....

Or that's a ridiculous exaggeration of the facts.

Right up there with the "Weekly" updates....

@alex The smartest home in america runs on Control4 or another quality HA, not ST...


(Chris ) #42

My few things appear to be working tonight


(Never Trust @bamarayne) #43

So if these scheduling clusters were discovered to be crashing yesterday... why was the status not put up on that until today...

??


(Bobby) #44

There...

SmartApp executions have improved as we have implemented short-term mitigations. We continue to actively monitor performance while investigating the root cause of the degradation.
Sep 30, 20:27 EDT


(vlad) #45

Unfortunately I don't have much in terms of an update but figured I'd chime in again since its been a while since the last post.

We are still performing rolling restarts on the scheduler cluster as bad nodes trigger alerts. Schedules have more or less stabilized (at the expense of our ops team's weekend). There's a very small chance of a scheduled execution being missed when a restart is performed but since the restarts are staggered the impact should be minimal.

On the engineering side we've spun up canary nodes with updated JVM settings to see if we can stabilize the servers without having to perform restarts but stabilizing is still just a short term fix until we understand what/how memory on the server is being filled in this way.

Now to answer your question... Servers crashing shouldn't cause an outage as redundancy is built into the platform. While we're at our desks or at home actively working on the problem we can stay ahead of crashes and the impact on users is minimal as there is a cluster of servers available to handle the requests - which is why we didn't think it warranted a status page update. What happened overnight and into this morning was an outage because the monitoring that was put in place wasn't effective and alerts weren't sent out while people were sleeping and not yet in the office.

To be completely honest if this was January there probably wouldn't have been a status update in the first place and if there was it would have been cleared by the time we got rolling restarts in place. Recently we have tried to be more transparent about issues with the platform which is why the status has not yet been cleared - yes schedules are more or less stable right now but the platform is not yet healthy.

edit: Don't mean to get too corny here but wanted to say thanks to the community devs in slack who chimed in today to provide some data points, insight into memory issues and support. You know who you are :smiley:


[RELEASE CANDIDATE] CoRE (Community's own Rule Engine)
(Bobby) #46

Take this, 'scheduler cluster ', who needs you anyway? Thanks @joshua_lyon


[RELEASE] Lowes Iris and XFinity CentraLite keypad DTH + Apps
(vlad) #47

Hey now, don't blame this on ticker! Ticker is pretty amazing and afaik there hasn't been a problem with it since I joined (early April). The issue is with the cluster that executes the SmartApps... internally we call it the scheduler cluster because it executes jobs that ticker tells it to execute. That is what is crashing - ticker is chugging along fine.


(Bobby) #48

Correction made :smile:


(John C) #49

Have had a hunch it wasn't ticker, specifically.

To wit: this afternoon I finished integration of our smoke alarms into ST (a modified contact sensor working from a Kidde SM120X Relay/Power Supply Module). During the course of testing, I set off the smoke alarms maybe a dozen times & observed the 'smoke' device toggle between 'detected' and 'clear.' Cool.

Next was to set up SHM for this site (previously I used it only for watching door sensors on our second home). This instance was set up to 1) send a notification to me and 2) turn on some lights.

Mostly, it worked. But, one time only two of the three specified lights came on.


( I ❤︎ ST (Star Trek??!)) #50

Anyone wants coffee? Long weekend ahead.


After a year of use, still a disappointment
(vlad) #51

Another update:
We have made some changes to the JVM GC that helps limit off heap memory and reduces GC contention under certain types of load. We have currently been running scheduler nodes without a similar crash for about 24 hours (they were being restarted every 2 hours when I last posted an update). We should be updating our status page back down to monitoring sometime soon and if we continue to run smoothly for a while longer we'll drop the status to operational again.


(Zachary Evans) #52

What is this and how do I get it lol


(Bobby) #53

From left to right, first 2 rows are Sharp Tools (Android). Third Hue widgets. Fourth Do buttons by IFTTT. If you are on Android, look up Sharp Tools on Google Play...it's a community smart app developed by @joshua_lyon....