Scheduled jobs failing (again) (again 😥) (Ongoing Known Issue)

A rundown of what is going on -

Yesterday around 12:00 CDT we started seeing crashes in the Scheduler cluster. They seem to be happening with a fairly regular period of every 2h for scheduler. It looks like the errors are caused by the servers running out of memory. The JVMs on these servers are tuned to have additional memory available (as standard) but something is causing the JVM to attempt to allocate additional memory when none is available on the system (this should never happen). Current theory is this is off heap memory. This leads to a fatal crash of the JVM with a malloc error.

We haven’t encountered these types of errors before and had to create some new monitoring rules to react prior to the servers crashing. These rules weren’t configured correctly which resulted in the scheduler cluster being severely under provisioned from 01:00 to 06:00 CDT - which was the reason why many scheduled executions failed to execute. (Yes we have other metrics available that would have signaled a problem but don’t normally use them for alerts).

I’ll provide an update when I get more info - the dumps we’re looking at are ~15gb in size so identifying what the root cause is will take some time. Schedules should be performing better than they did overnight but expect some rockiness as servers are constantly being replaced to keep up with the crashes. Will also ask to get the status page updated for last night & continued degraded performance until we figure out what is causing these crashes.

19 Likes