Unfortunately I don’t have much in terms of an update but figured I’d chime in again since its been a while since the last post.
We are still performing rolling restarts on the scheduler cluster as bad nodes trigger alerts. Schedules have more or less stabilized (at the expense of our ops team’s weekend). There’s a very small chance of a scheduled execution being missed when a restart is performed but since the restarts are staggered the impact should be minimal.
On the engineering side we’ve spun up canary nodes with updated JVM settings to see if we can stabilize the servers without having to perform restarts but stabilizing is still just a short term fix until we understand what/how memory on the server is being filled in this way.
Now to answer your question… Servers crashing shouldn’t cause an outage as redundancy is built into the platform. While we’re at our desks or at home actively working on the problem we can stay ahead of crashes and the impact on users is minimal as there is a cluster of servers available to handle the requests - which is why we didn’t think it warranted a status page update. What happened overnight and into this morning was an outage because the monitoring that was put in place wasn’t effective and alerts weren’t sent out while people were sleeping and not yet in the office.
To be completely honest if this was January there probably wouldn’t have been a status update in the first place and if there was it would have been cleared by the time we got rolling restarts in place. Recently we have tried to be more transparent about issues with the platform which is why the status has not yet been cleared - yes schedules are more or less stable right now but the platform is not yet healthy.
edit: Don’t mean to get too corny here but wanted to say thanks to the community devs in slack who chimed in today to provide some data points, insight into memory issues and support. You know who you are