Yea - one of the issues was âfixedâ by replacing 3 nodes in our events cluster for na01. These are read timeouts from that cluster. Reducing this should have helped with the UI, IDE, smartapp executions, etcâŠ
The other issue that weâre seeing is timeouts for saving events:
It doesnât look like the replacement fixed the issue here as the timeouts are still elevated and follow a cyclical pattern. (Can see other Cassandra metrics trending upwards again). This mainly affects execution and the chance of it occurring increases in proportion to the rate of created events per app execution. So while many smartapps/devices are firing fine certain ones are hit more often (the ones that create more events). We have a change that can go out to make the event creation async which would help with executions but that could have a number of unintended side affects and would change the behavior of the core part of the system - it would be much safer to figure out what happened in the last couple of days to cause these spikes.