First graph shows a drop in CPU usage because of the decrease in garbage collection frequency which translates into less time spent by the nodes in garbage collection. When a major garbage collection event occurs it is a “stop the world” scenario where all processing stops during that time apart from garbage collection. Since the number of classes that need to be loaded have decreased dramatically the GC doesn’t need to run as often or for as long. The second graph shows that our memory usage dropped to about half when loading these classes. Previously we were pretty much at the limit of classes that we could load so garbage collection would occur very frequently now since there is a lot more headroom with the decreased memory usage garbage collection does not occur nearly as often.
So for example say it’s 5:00 PM and a lot of people have executions scheduled as they come home - prior to the change every installation of CoRE or Nest Manager would load it’s own class into the JVM. This would result in a lot of duplicated classes being loaded and having to be unloaded for the next person’s execution to be loaded afterwards because a garbage collection needs to be performed to make room for the next execution. As the number of users/schedules increases then we end up spending more and more time loading/unloading classes to make room for them in the JVM. With this change the vast majority of users are using the latest code for these Apps so they all share the same class - so no (or a lot less) time has to be spent loading/unloading these SmartApps.
We have not automatically updated anyone’s SmartApps - the reason we saw a drastic reduction in CPU usage is because in the worst case scenario (everyone having different versions of a SmartApp) we would not have seen any reduction in CPU usage/GC times but because most people are already using the same version of the SmartApp (the latest one) the GC times dropped as all of those people share the same Class in the JVM when they executed. I do recommend updating to the latest code when you can though - as it is much more likely that the SmartApp that is being executed is already loaded into the JVM at that point. For small apps it’s not a big deal as it doesn’t take long to load the Class into the JVM but as the apps get larger, it takes significantly more time to parse/load the class (and then unload it after)
Canary nodes are generally used to verify JVM tuning parameters before rolling them out to all of the servers. Changes are tested in lower environments first but it is safer to update a couple of nodes when we can before rolling out the changes to all nodes in the cluster. In this case the canaries were running under the same settings as the rest of the cluster so it doesn’t mean much but I just wanted to call it out as there was a big red line in the graph. (Should’ve been clearer on that in my first post).