Scheduled jobs failing (again) (again đŸ˜„) (Ongoing Known Issue)

Yea - one of the issues was “fixed” by replacing 3 nodes in our events cluster for na01. These are read timeouts from that cluster. Reducing this should have helped with the UI, IDE, smartapp executions, etc


The other issue that we’re seeing is timeouts for saving events:

It doesn’t look like the replacement fixed the issue here as the timeouts are still elevated and follow a cyclical pattern. (Can see other Cassandra metrics trending upwards again). This mainly affects execution and the chance of it occurring increases in proportion to the rate of created events per app execution. So while many smartapps/devices are firing fine certain ones are hit more often (the ones that create more events). We have a change that can go out to make the event creation async which would help with executions but that could have a number of unintended side affects and would change the behavior of the core part of the system - it would be much safer to figure out what happened in the last couple of days to cause these spikes.

4 Likes