Scheduler and Polling quits after some minutes, hours, or days

I have a SmartApp that brightens lights in a zone when there is motion or a door opens. It dims the lights after a given timeout period.

I have 6 instances of the app on one hub, (6 different lighting zones), and 8 on another hub, in another building.
When the app initializes, it starts the scheduler to run a scheduleCheck() function every minute, to turn off lights, when appropriate.

It was working for months. Then it stopped dimming reliably.
I have determined that the scheduleCheck() function stops running as scheduled.

I have also noticed that it now usually schedules at the top of the minute, i.e., the seconds are zero. This did not used to be the case. And the time between scheduled calls can be 1, 2, or 3 minutes.

I have just installed a workaround, in which I re-initialize the scheduler every time motion starts or stops, or a door opens or closes. I think that the worst case will be if the scheduler quits before the lights are dimmed. Then they will stay bright until the next time motion is detected, which would be better than staying bright until I manually reset the app.

2 Likes

I’ve been having a similar problem with logging to Grovestreams (based on Jason Steele’s code).
My smart app has been running with very little problems for a couple months, but now the scheduler just seems to quit. Watching live logging, the event subscription is still working as temperatures are appended to the queue, but the post to Grovestreams that is run by the scheduler just stops.

I have to reinitialize the smart app to get it working again.
Over the past few days, it’s stayed running for less than 24 hours at a time.
Initially, the scheduler ran every one minute. I changed it to every 2 minutes yesterday hoping that would help but so far no luck, it still quit after about 8 hours.

It was running at :00 seconds. I’ve just changed it again to run at :40 seconds every two minutes to see if that helps. Maybe it’s contention with too many other schedulers running at the same time?

I’ve been seeing this recently as well. The ticket I’ve had open for like 4 months now is still being looked at. Karl said he’d get back to me on Tuesday.

Has this workaround been working? I was under the impression that when the scheduled jobs stopped it was because the entire SmartApp had been killed.

Yes, it’s been working to restart the scheduler every time the motion handler is called.

I’ve noticed as much as 7 minutes between calls to scheduleCheck(), which is supposed to be run every minute.

This just happened to me overnight, for 4 different apps across 2 locations for apps that use schedule() or runEveryXXMinutes(). The scheduled routine just stops being called, although in one app a separate scheduled routine (once daily) ran as scheduled, long after the other (more frequent) routine stopped being called.

This spells #FAIL for devices that have to be polled regularly and frequently to get meaningful status updates (like weather stations, thermostats, garage door openers, and the like).

I’m having this same problem with the MyQ garage app. It relies on API calls out to Liftmaster to get the door status. It has a scheduler to poll the API to keep the status refreshed. But for whatever reason, the scheduler has been silently dying randomly. Sometimes it will go a few days, other times it will only last a few hours. I’ve tried switching to using a chained RunIn call once per minute which would reinitialize the scheduler each run, but even that died after awhile. I’m at a loss right now. Because it dies so randomly, I can’t even debug it.

btk wrote, “I was under the impression that when the scheduled jobs stopped it was because the entire SmartApp had been killed.”

Not in my experience. I see in the log that when the scheduleCheck() function stops running every minute as scheduled, the motion handler still works.

I finally got an error on my logging to Grovestreams after the scheduler stopped.
sensor readings were still trying to append to the queue, but the ProcessQueue was stopped.

Here is what the logs showed:

   c155af1c-XXXX-XXXX-b406-b346f1934308 12:29:16 PM: debug Appending to queue [compId:Office Sensor, streamId:temperature, data:77.0, time:1432571356857]
c155af1c-XXXX-XXXX-b406-b346f1934308 12:29:03 PM: error com.netflix.hystrix.exception.HystrixRuntimeException: C*-IsaState-Update timed-out and fallback disabled.
c155af1c-XXXX-XXXX-b406-b346f1934308 12:28:51 PM: debug Appending to queue [compId:Office Sensor, streamId:temperature, data:77.0, time:1432571331738]
c155af1c-XXXX-XXXX-b406-b346f1934308 12:28:50 PM: error com.netflix.hystrix.exception.HystrixRuntimeException: C*-IsaState-Update timed-out and fallback disabled.
c155af1c-XXXX-XXXX-b406-b346f1934308 12:28:38 PM: debug Appending to queue [compId:Office Sensor, streamId:temperature, data:77.0, time:1432571318090]
c155af1c-XXXX-XXXX-b406-b346f1934308 12:26:31 PM: debug Appending to queue [compId:Den Sensor, streamId:temperature, data:70.6, time:1432571191318]

Going through the mobile app to uncheck and recheck one of the devices reset everything and it started logging again.
So this time it ran for about 16 hours before the issue occured.

Every response I’ve gotten from support on this for the last several months is “Yeah, we had a problem but it’s fixed now…”, so I’ve rigged up what I consider a pretty good workaround.

I’ve added two web endpoints to each of my smartapps whose schedulers are important to me.

The first checks the time the last scheduled job ran (updated in state at each run) against the current time. If it’s been long enough to declare the scheduler dead, it returns “FAIL”, otherwise “FIRING” along with the name of the app and the last time the scheduler ran.

The second simply calls my method that creates the schedules.

This alone makes it easy enough to just bookmark the URLs with all of the auth tokens so it’s a one-click operation to check on the scheduler and another click to restart it via my browser, but we’re talking about home automation, and that’s not very automatic.

This is the part of the post where I go overboard, so feel free to tune out at this point. =)

I’m a network engineer by trade, and that involves monitoring things. My current free weapon of choice is Icinga. Using this, I can have it watch each of my “stamp” URLs. If it sees “FIRING” on the page, it does nothing. If it sees “FAIL”, it hits the reschedule URL. Automation!

Now I just need something to monitor my monitoring… :stuck_out_tongue_winking_eye:

2 Likes

This…is fantastic. I was actually planning on writing a very similar setup to monitor my app’s problematic cron job. Mostly I was hoping it might shed some light into what exactly happens when the scheduler dies. It just feels wrong because, as you say, it’s hardly automation.

I hadn’t heard of Icinga. I have a probe set up at montastic that watches for a URL on my network, so that’s been good enough for me just to make sure a particular service is running at home.

1 Like

@btk
Do you think it would be OK to reschedule the jobs periodically?

I’m experimenting with something like that. In my smartapp, I created a second scheduled job whose single job is to make sure the first is running, and if it died, recreate it. The second “monitoring” job checks the last poll timestamp every 5 minutes, and if enough time has past to confirm the first scheduler is dead, it brings it back to life. It’s working well so far - I’ve had it successfully restarting things for me about 3 times in the last 24 hours. My hope is that because the monitoring job’s task is so light, it’s less likely to succumb to SSDS (Sudden Scheduler Death Syndrome).

Is this the official term?

Still unofficial, but I’m working on it.

2 Likes

Maybe. I didn’t do it that way because to do it cleanly you need to unschedule and then reschedule everything. Doing it blindly every so often would cause the next scheduled run to be deleted and effectively pushed back by the reschedule.

Still no useful word from support, incidentally. @Tyler escalated my ticket to Mager a while back who then closed it saying that they had “platform problems” one night. Now I’ve got a new ticket open with L1 and I’m the responses I’m told to expect just never come. When I ask what’s up, they say they need a few more days. Frustrating to no end.

1 Like

And what happens when your second scheduled job dies too? ST’s been having problems with their scheduler from day one and despite repeated assurances that “this time its really fixed”, it still doesn’t work. It’s pathetic.

3 Likes

I get the feeling they’re just waiting for “Hub v2” and hoping that it relieves the pressure on the cloud scheduler.

Yes, that’s what I’m afraid of. So far, that hasn’t happened. I’m hoping the second one is less likely to die because it does so little in terms of activity. I read somewhere on here that the schedulers die because they sometimes hit a 20-second running limit while doing their job, sometimes due to ST system just being slow. I figured if the monitoring scheduler does as little as possible, it won’t hit that limit and won’t just die.