Scheduled jobs failing (again) (again 😥) (Ongoing Known Issue)

JH1 · September 30, 2016, 11:09pm

yea, it’s so smart it knows the previous light on/off patterns from when he is home and it emulates them when he is not…

Or that’s a ridiculous exaggeration of the facts.

Right up there with the “Weekly” updates…

@alex The smartest home in america runs on Control4 or another quality HA, not ST…

cldlhd · September 30, 2016, 11:10pm

My few things appear to be working tonight

JH1 · September 30, 2016, 11:11pm

So if these scheduling clusters were discovered to be crashing yesterday… why was the status not put up on that until today…

??

SBDOBRESCU · October 1, 2016, 12:29am

There…

SmartApp executions have improved as we have implemented short-term mitigations. We continue to actively monitor performance while investigating the root cause of the degradation.
Sep 30, 20:27 EDT

vlad · October 1, 2016, 12:47am

Unfortunately I don’t have much in terms of an update but figured I’d chime in again since its been a while since the last post.

We are still performing rolling restarts on the scheduler cluster as bad nodes trigger alerts. Schedules have more or less stabilized (at the expense of our ops team’s weekend). There’s a very small chance of a scheduled execution being missed when a restart is performed but since the restarts are staggered the impact should be minimal.

On the engineering side we’ve spun up canary nodes with updated JVM settings to see if we can stabilize the servers without having to perform restarts but stabilizing is still just a short term fix until we understand what/how memory on the server is being filled in this way.

Now to answer your question… Servers crashing shouldn’t cause an outage as redundancy is built into the platform. While we’re at our desks or at home actively working on the problem we can stay ahead of crashes and the impact on users is minimal as there is a cluster of servers available to handle the requests - which is why we didn’t think it warranted a status page update. What happened overnight and into this morning was an outage because the monitoring that was put in place wasn’t effective and alerts weren’t sent out while people were sleeping and not yet in the office.

To be completely honest if this was January there probably wouldn’t have been a status update in the first place and if there was it would have been cleared by the time we got rolling restarts in place. Recently we have tried to be more transparent about issues with the platform which is why the status has not yet been cleared - yes schedules are more or less stable right now but the platform is not yet healthy.

edit: Don’t mean to get too corny here but wanted to say thanks to the community devs in slack who chimed in today to provide some data points, insight into memory issues and support. You know who you are

SBDOBRESCU · October 1, 2016, 12:59am

Take this, 'scheduler cluster ', who needs you anyway? Thanks @joshua_lyon

vlad · October 1, 2016, 1:03am

Hey now, don’t blame this on ticker! Ticker is pretty amazing and afaik there hasn’t been a problem with it since I joined (early April). The issue is with the cluster that executes the SmartApps… internally we call it the scheduler cluster because it executes jobs that ticker tells it to execute. That is what is crashing - ticker is chugging along fine.

SBDOBRESCU · October 1, 2016, 1:07am

Correction made

Barkis · October 1, 2016, 1:26am

Have had a hunch it wasn’t ticker, specifically.

To wit: this afternoon I finished integration of our smoke alarms into ST (a modified contact sensor working from a Kidde SM120X Relay/Power Supply Module). During the course of testing, I set off the smoke alarms maybe a dozen times & observed the ‘smoke’ device toggle between ‘detected’ and ‘clear.’ Cool.

Next was to set up SHM for this site (previously I used it only for watching door sensors on our second home). This instance was set up to 1) send a notification to me and 2) turn on some lights.

Mostly, it worked. But, one time only two of the three specified lights came on.

ady624 · October 1, 2016, 4:01am

Anyone wants coffee? Long weekend ahead.

vlad · October 4, 2016, 7:27pm

Another update:
We have made some changes to the JVM GC that helps limit off heap memory and reduces GC contention under certain types of load. We have currently been running scheduler nodes without a similar crash for about 24 hours (they were being restarted every 2 hours when I last posted an update). We should be updating our status page back down to monitoring sometime soon and if we continue to run smoothly for a while longer we’ll drop the status to operational again.

Zevans08 · October 5, 2016, 4:51pm

What is this and how do I get it lol

SBDOBRESCU · October 5, 2016, 4:55pm

From left to right, first 2 rows are Sharp Tools (Android). Third Hue widgets. Fourth Do buttons by IFTTT. If you are on Android, look up Sharp Tools on Google Play…it’s a community smart app developed by @joshua_lyon…

Zevans08 · October 5, 2016, 4:57pm

Thank you, will do

vlad · October 5, 2016, 5:43pm

Status has been updated to Resolved.

Mbhforum · October 5, 2016, 5:48pm

Thanks @vlad. I updated the thread title. If any community members still have issues, please update it back to Ongoing.

whoismoses · October 7, 2016, 11:35pm

Seems like for the last two days I’ve still had a few failed schedules.

whoismoses · October 8, 2016, 12:04am

I’m struggling to remember, how do you get to the scheduled jobs screen?

SBDOBRESCU · October 8, 2016, 12:09am

In ide location then smartapps the click on the app that is scheduled.

bago · October 8, 2016, 12:45am

I had a couple fail tonight.

Topic		Replies	Views
Frickin' schedules again!? General Discussion	15	2200	September 28, 2016
Automatic routines/modes not working, again General Discussion	7	1451	October 17, 2016
Scheduled Jobs Failing AGAIN Devices & Integrations	3	1489	February 2, 2016
The usual: automations fail (1 March 2021) General Discussion	2	390	March 2, 2021
Time-based events failing all over my home SmartApps & Automations	8	1488	August 22, 2014

Scheduled jobs failing (again) (again 😥) (Ongoing Known Issue)

Related topics