SmartApp executions have improved as we have implemented short-term mitigations. We continue to actively monitor performance while investigating the root cause of the degradation.
Sep 30, 20:27 EDT
Unfortunately I donât have much in terms of an update but figured Iâd chime in again since its been a while since the last post.
We are still performing rolling restarts on the scheduler cluster as bad nodes trigger alerts. Schedules have more or less stabilized (at the expense of our ops teamâs weekend). Thereâs a very small chance of a scheduled execution being missed when a restart is performed but since the restarts are staggered the impact should be minimal.
On the engineering side weâve spun up canary nodes with updated JVM settings to see if we can stabilize the servers without having to perform restarts but stabilizing is still just a short term fix until we understand what/how memory on the server is being filled in this way.
Now to answer your question⊠Servers crashing shouldnât cause an outage as redundancy is built into the platform. While weâre at our desks or at home actively working on the problem we can stay ahead of crashes and the impact on users is minimal as there is a cluster of servers available to handle the requests - which is why we didnât think it warranted a status page update. What happened overnight and into this morning was an outage because the monitoring that was put in place wasnât effective and alerts werenât sent out while people were sleeping and not yet in the office.
To be completely honest if this was January there probably wouldnât have been a status update in the first place and if there was it would have been cleared by the time we got rolling restarts in place. Recently we have tried to be more transparent about issues with the platform which is why the status has not yet been cleared - yes schedules are more or less stable right now but the platform is not yet healthy.
edit: Donât mean to get too corny here but wanted to say thanks to the community devs in slack who chimed in today to provide some data points, insight into memory issues and support. You know who you are
Hey now, donât blame this on ticker! Ticker is pretty amazing and afaik there hasnât been a problem with it since I joined (early April). The issue is with the cluster that executes the SmartApps⊠internally we call it the scheduler cluster because it executes jobs that ticker tells it to execute. That is what is crashing - ticker is chugging along fine.
Have had a hunch it wasnât ticker, specifically.
To wit: this afternoon I finished integration of our smoke alarms into ST (a modified contact sensor working from a Kidde SM120X Relay/Power Supply Module). During the course of testing, I set off the smoke alarms maybe a dozen times & observed the âsmokeâ device toggle between âdetectedâ and âclear.â Cool.
Next was to set up SHM for this site (previously I used it only for watching door sensors on our second home). This instance was set up to 1) send a notification to me and 2) turn on some lights.
Mostly, it worked. But, one time only two of the three specified lights came on.
Another update:
We have made some changes to the JVM GC that helps limit off heap memory and reduces GC contention under certain types of load. We have currently been running scheduler nodes without a similar crash for about 24 hours (they were being restarted every 2 hours when I last posted an update). We should be updating our status page back down to monitoring sometime soon and if we continue to run smoothly for a while longer weâll drop the status to operational again.
From left to right, first 2 rows are Sharp Tools (Android). Third Hue widgets. Fourth Do buttons by IFTTT. If you are on Android, look up Sharp Tools on Google PlayâŠitâs a community smart app developed by @joshua_lyonâŠ