Scheduled jobs failing (again) (again 😥) (Ongoing Known Issue)

vlad · September 30, 2016, 5:23pm

A rundown of what is going on -

Yesterday around 12:00 CDT we started seeing crashes in the Scheduler cluster. They seem to be happening with a fairly regular period of every 2h for scheduler. It looks like the errors are caused by the servers running out of memory. The JVMs on these servers are tuned to have additional memory available (as standard) but something is causing the JVM to attempt to allocate additional memory when none is available on the system (this should never happen). Current theory is this is off heap memory. This leads to a fatal crash of the JVM with a malloc error.

We haven’t encountered these types of errors before and had to create some new monitoring rules to react prior to the servers crashing. These rules weren’t configured correctly which resulted in the scheduler cluster being severely under provisioned from 01:00 to 06:00 CDT - which was the reason why many scheduled executions failed to execute. (Yes we have other metrics available that would have signaled a problem but don’t normally use them for alerts).

I’ll provide an update when I get more info - the dumps we’re looking at are ~15gb in size so identifying what the root cause is will take some time. Schedules should be performing better than they did overnight but expect some rockiness as servers are constantly being replaced to keep up with the crashes. Will also ask to get the status page updated for last night & continued degraded performance until we figure out what is causing these crashes.

Topic		Replies	Views
Frickin' schedules again!? General Discussion	15	2204	September 28, 2016
Automatic routines/modes not working, again General Discussion	7	1453	October 17, 2016
Scheduled Jobs Failing AGAIN Devices & Integrations	3	1491	February 2, 2016
The usual: automations fail (1 March 2021) General Discussion	2	393	March 2, 2021
Time-based events failing all over my home SmartApps & Automations	8	1490	August 22, 2014

Scheduled jobs failing (again) (again 😥) (Ongoing Known Issue)

Related topics