Wanted to take a quick break to give an update from the engineering side of things as we don’t always do the best job of keeping @slagle in the loop at all times.
There are a few different issues in play on the scheduling side of things right now:
This is the issue most people are seeing - in July we pushed out a fix that was causing our database connections to spike and lock up boxes on the scheduling cluster. We were expecting this fix to have a bigger effect on scheduling timeouts than it did. We are continuing to look and there are a few things that are standing out right now. The following is a graph of CPU load/averages across boxes in the scheduling cluster over the past couple of months in tNA00/NA01
(na00/na01 are names for the same shard):
Generally you want your load average to be at a number that is less than the number of cores on the box - if its higher than that you will begin to see back ups. This is a helpful metric in addition to average CPU usage. These execution boxes generally run at 8 cores and we like to keep the load average fairly low in case of a spike in traffic. This can be caused by organic growth of the platform (increase in number of scheduled app executions), a smart app (community or ST) incorrectly scheduling, or a variety of other issues. These graphs seem to follow a trend of when we started to notice an uptick in support tickets. We have deployed additional scheduler boxes to bring the load average back down to pre mid July numbers. Here is what the graph looks like after the deployment:
A running theory is that during high load times a high(er) CPU average may have downstream effects on the box because it will peg the CPU in certain instances when otherwise it wouldn’t have if the CPU utilization was lower as the load average of 4 shouldn’t have been causing issues in the first place. This will likely help with timeouts for the time being but doesn’t address the root cause.
One of our main focuses right now is looking at the ClassLoader of the JVM on the scheduler cluster - a thing to note is that the cluster we use to execute SmartApps off of event subscriptions is running the same platform code and executing more SmartApps than the scheduler cluster but it is not seeing the same ClassLoader issues that the scheduler cluster is. The scheduler cluster is in a state where major GC collections are occurring frequently. When the JVM goes into a major GC it stops all application threads until the GC is complete - causing our scheduled SmartApp executions to stall and eventually time out. We have a few tools at our disposal to help us diagnose this but it’s not something that can be quickly pinned down to a root cause and involves digging into the JVM in Production in a way that has minimal impact on users. Running theory here is that there are usage patterns in the scheduler cluster that we haven’t accounted for yet that require additional tuning. I think is where the most bang for the buck would be in terms of impact on SmartApp timeouts.
- SmartApps not executing at all
This is the latest in the series of scheduler failures that we have come across. You can identify if a SmartApp is being affected (Our support team does this before forwarding the tickets to engineering) by looking at the scheduled event history for your SmartApp via the Location screen. You will see a “Scheduled Execution” for the correct time in that screen but you will notice that there are no executions at the specified times. e.g) A SmartApp is scheduled to run at 10:00PM every day but the it hasn’t executed for N days:
What we see happening is that ticker logs that a job was sent to the schedule cluster and no further processing is completed. I have (personally) verified that ticker is putting messages on the job queue by polling queues for affected SmartApps and seeing the messages go across the wire. Currently trying to figure out where we can add additional logging to identify the exact spot of the failure to execute but do want to mention that we have a ton of logging in this area already and identifying a path that the code could take without SOME message being produced is where the difficultly is. This could be somehow related to the Garbage Collection but I’d rather not make any assumptions and we are keeping a work stream open in this area as well as diagnosing these GC issues.
- SmartApps executing at the wrong time
This is another scheduling error that we have seen occur lately but is limited to SmartApps that were executing locally but were then updated to execute in the cloud (either manually or because an incompatible Device Type was added). A bug was causing the update to not unschedule the app from the hub. We have identified the root cause for this issue and a fix has been put in place - currently in lower environments waiting on QA before a PROD deploy.
I think these cover the cases that the majority of the requests we are seeing are related to (@Aaron let me know if I missed one and I’ll see if I can provide more detail)
Also wanted to mention that support isn’t really kept in the loop for these issues in this much detail. The relationship is much more support feeding info to engineering than the other way around. The things that you guys don’t see is that every support ticket that goes in with regards to these (and most other) failures ends up with a very verbose description of the problem that makes it easier for engineering to diagnose these issues. The majority of the time the issues that appear aren’t anywhere as clear cut as these are and I’m constantly impressed with the feedback that they provide.