On Friday October 14 we pushed a platform update that changed the way that caching for SmartApps works. Prior to the update every installation of a community SmartApp was cached independently. This type of caching worked fine at first but did not scale well when more users began to adopt large community smart apps. The biggest impact of this was during scheduled executions because we would load up and evict many SmartApps very quickly which would cause garbage collection pauses (I have wrote about this many times this year). With the new way of caching - we take a checksum of the code that is being executed and cache the SmartApp using the checksum as a key. This significantly reduces the amount of SmartApps that need to be stored in our cache and results in much more scalable, performant & efficient executions.
Here are a couple of interesting figures prior to and after the update:
Average CPU Usage across the scheduler cluster:
While just applying the update itself had a significant impact on the performance of the scheduler, we can see even greater gains by having everyone use the latest version of a community SmartApp. So as if you needed any more reasons to update to the latest version of the community apps you’re using… faster and more reliable execution from the platform is now another one.
As long as you don’t reverse that mistake, you’ll be fine.
(ActionTiles.com co-founder Terry @ActionTiles; GitHub: @cosmicpuppy)
While using a Checksum to identify identical code and making the suggestion above is excellent; I hope this is only a “stop-gap” measure to the real solution: Allow full distribution self-publishing of SmartApps to the Community.
Oh wait… we already had that ability and it was revoked. SMH .
First graph shows a drop in CPU usage because of the decrease in garbage collection frequency which translates into less time spent by the nodes in garbage collection. When a major garbage collection event occurs it is a “stop the world” scenario where all processing stops during that time apart from garbage collection. Since the number of classes that need to be loaded have decreased dramatically the GC doesn’t need to run as often or for as long. The second graph shows that our memory usage dropped to about half when loading these classes. Previously we were pretty much at the limit of classes that we could load so garbage collection would occur very frequently now since there is a lot more headroom with the decreased memory usage garbage collection does not occur nearly as often.
So for example say it’s 5:00 PM and a lot of people have executions scheduled as they come home - prior to the change every installation of CoRE or Nest Manager would load it’s own class into the JVM. This would result in a lot of duplicated classes being loaded and having to be unloaded for the next person’s execution to be loaded afterwards because a garbage collection needs to be performed to make room for the next execution. As the number of users/schedules increases then we end up spending more and more time loading/unloading classes to make room for them in the JVM. With this change the vast majority of users are using the latest code for these Apps so they all share the same class - so no (or a lot less) time has to be spent loading/unloading these SmartApps.
We have not automatically updated anyone’s SmartApps - the reason we saw a drastic reduction in CPU usage is because in the worst case scenario (everyone having different versions of a SmartApp) we would not have seen any reduction in CPU usage/GC times but because most people are already using the same version of the SmartApp (the latest one) the GC times dropped as all of those people share the same Class in the JVM when they executed. I do recommend updating to the latest code when you can though - as it is much more likely that the SmartApp that is being executed is already loaded into the JVM at that point. For small apps it’s not a big deal as it doesn’t take long to load the Class into the JVM but as the apps get larger, it takes significantly more time to parse/load the class (and then unload it after)
Canary nodes are generally used to verify JVM tuning parameters before rolling them out to all of the servers. Changes are tested in lower environments first but it is safer to update a couple of nodes when we can before rolling out the changes to all nodes in the cluster. In this case the canaries were running under the same settings as the rest of the cluster so it doesn’t mean much but I just wanted to call it out as there was a big red line in the graph. (Should’ve been clearer on that in my first post).
I always enjoy hearing about these kind of updates and efficiencies and it shows we are moving to a much more scalable platform. The ironic thing about this is I would say the system has been a wreck for about a week now but hopefully more changes like this keep the big machine running smoothly. Keep up the good work guys.
Perhaps I am missing something here. Performance in the iOS mobile app has been awful this week. Is this just laying the groundwork for improvements and that is why we haven’t seen any improvements this week?
CoRE and Nest Manager are very slow. Even going into my “Smartapps” tab is extremely slow. I know you said they need to be updated, but not sure if those are being planned?
The issues you are describing are unrelated to this update, they are being cause by latency problems with one of the databases. While these changes provide some overall performance improvements - unfortunately they don’t help with database timeouts.
So for those of us that are a bit thick, old style was basically a JVM gets spooled up for each instance of a SmartApp PER USER. Now only one gets spooled for each version but more than one users can use it at a time? So GC only occurs when no users are actively “touching” that JVM, vs before each users app JVM would be killed at the end of that users use of it?
Not quite - there is only one JVM in play per Scheduler node. The actual executions are sandboxed within that JVM. That hasn’t changed. Think of it like this:
User A & B both have a community SmartApp. The code for the SmartApp is the exact same, meaning that the code gets compiled into the same Class. Previously we would create 2 Classes (One for each user) - each taking up its own memory. Now when user B executes (given that user A already did) when we look up the SmartApp to execute in the cache, we see that there is already a SmartApp with the same checksum in the cache (aka the code is the same) so instead of creating a new class for user B we just reuse the one that was created for user A. GC comes into the picture when we multiple this situation by thousands or tens of thousands of users and smartapps and we are at a memory limit. If User B tried to to execute and there wasn’t enough room in memory to create his SmartApp, User A’s Smartapp would have to be garbage collected first (major GC - stop the world collection because of how Class unloading works) to make room for User B’s SmartApp. Now - since we’ve de-duplicated a lot of SmartApps we’re not at the limit anymore and even if we were, chances are that the SmartApp is already in memory and we wouldn’t have to perform a GC anyway.