Scheduled jobs failing (again) (again 😥) (Ongoing Known Issue)

yea, it’s so smart it knows the previous light on/off patterns from when he is home and it emulates them when he is not…

Or that’s a ridiculous exaggeration of the facts.

Right up there with the “Weekly” updates…

@alex The smartest home in america runs on Control4 or another quality HA, not ST…

My few things appear to be working tonight

So if these scheduling clusters were discovered to be crashing yesterday… why was the status not put up on that until today…

??

2 Likes

There…

SmartApp executions have improved as we have implemented short-term mitigations. We continue to actively monitor performance while investigating the root cause of the degradation.
Sep 30, 20:27 EDT

Unfortunately I don’t have much in terms of an update but figured I’d chime in again since its been a while since the last post.

We are still performing rolling restarts on the scheduler cluster as bad nodes trigger alerts. Schedules have more or less stabilized (at the expense of our ops team’s weekend). There’s a very small chance of a scheduled execution being missed when a restart is performed but since the restarts are staggered the impact should be minimal.

On the engineering side we’ve spun up canary nodes with updated JVM settings to see if we can stabilize the servers without having to perform restarts but stabilizing is still just a short term fix until we understand what/how memory on the server is being filled in this way.

Now to answer your question… Servers crashing shouldn’t cause an outage as redundancy is built into the platform. While we’re at our desks or at home actively working on the problem we can stay ahead of crashes and the impact on users is minimal as there is a cluster of servers available to handle the requests - which is why we didn’t think it warranted a status page update. What happened overnight and into this morning was an outage because the monitoring that was put in place wasn’t effective and alerts weren’t sent out while people were sleeping and not yet in the office.

To be completely honest if this was January there probably wouldn’t have been a status update in the first place and if there was it would have been cleared by the time we got rolling restarts in place. Recently we have tried to be more transparent about issues with the platform which is why the status has not yet been cleared - yes schedules are more or less stable right now but the platform is not yet healthy.

edit: Don’t mean to get too corny here but wanted to say thanks to the community devs in slack who chimed in today to provide some data points, insight into memory issues and support. You know who you are :smiley:

9 Likes

Take this, 'scheduler cluster ', who needs you anyway? Thanks @joshua_lyon

4 Likes

Hey now, don’t blame this on ticker! Ticker is pretty amazing and afaik there hasn’t been a problem with it since I joined (early April). The issue is with the cluster that executes the SmartApps… internally we call it the scheduler cluster because it executes jobs that ticker tells it to execute. That is what is crashing - ticker is chugging along fine.

7 Likes

Correction made :smile:

1 Like

Have had a hunch it wasn’t ticker, specifically.

To wit: this afternoon I finished integration of our smoke alarms into ST (a modified contact sensor working from a Kidde SM120X Relay/Power Supply Module). During the course of testing, I set off the smoke alarms maybe a dozen times & observed the ‘smoke’ device toggle between ‘detected’ and ‘clear.’ Cool.

Next was to set up SHM for this site (previously I used it only for watching door sensors on our second home). This instance was set up to 1) send a notification to me and 2) turn on some lights.

Mostly, it worked. But, one time only two of the three specified lights came on.

Anyone wants coffee? Long weekend ahead.

11 Likes

Another update:
We have made some changes to the JVM GC that helps limit off heap memory and reduces GC contention under certain types of load. We have currently been running scheduler nodes without a similar crash for about 24 hours (they were being restarted every 2 hours when I last posted an update). We should be updating our status page back down to monitoring sometime soon and if we continue to run smoothly for a while longer we’ll drop the status to operational again.

10 Likes

What is this and how do I get it lol

From left to right, first 2 rows are Sharp Tools (Android). Third Hue widgets. Fourth Do buttons by IFTTT. If you are on Android, look up Sharp Tools on Google Play…it’s a community smart app developed by @joshua_lyon

2 Likes

Thank you, will do

Status has been updated to Resolved.

5 Likes

Thanks @vlad. I updated the thread title. If any community members still have issues, please update it back to Ongoing.

2 Likes

Seems like for the last two days I’ve still had a few failed schedules.

3 Likes

I’m struggling to remember, how do you get to the scheduled jobs screen?

In ide location then smartapps the click on the app that is scheduled.

3 Likes

I had a couple fail tonight.

1 Like