Failures again

Every system has pluses and minuses. See the following:

Looking into this now. If confirmed, I’ll wake people up.

3 Likes

Hmm yea, noticed it too. My motion lights didnt turn off after 2 mins. Thought it was me from having added a new motion sensor and automation. They worked just now tho. Was about an hour ago as well.

I had some partial failures, looks like timeouts, on some of my bigger Pistons tonight as well.

And I guess I could be more clear. Only failures I noticed was turn off after 2mins of no motion. And that is all using SmartLighting stock app. With ST, iris, and aeon motion sensors. But that was all I noticed.

Looks like we had a small spike around the time of this post, but everything stabilized minutes later. I’ll keep an eye on it the rest of the night.

I’ll have an engineer look into the spike tomorrow if everything stays normal tonight.

1 Like

Well, I can confirm this…

  • 3rd day in a row that my system says a mode has changed… but the mode did not actually change
  • Lights turn on, but do not turn off
  • Z-wave devices having 5 seconds + delays
  • lots of automations failing due to time outs.

you should most definitely start waking people up… especially since this has all been talked about for over a week.

4 Likes

Have been having similar issues for the last few weeks. Devices do not respond immediately and some haven’t worked at all since mid July. Routines are not running properly, mode generally changes but lights will not turn on/off. Siren tied to the alarm has been triggered twice this month in the middle of the night with no reason for it to be activated.

I have been using various smarthome hubs for the last few years and have 60+ devices, but this is my first time participating in the community. The recent issues I have been having started to make me think something was wrong with my setup. After briefly reading through recent posts it appears it is a much bigger problem. Thanks everyone for letting me know I’m not alone as it was starting to drive me crazy. In the past I have expected that the Smartthings status page would identify that a problem exists. Often times nothing would be noted on the official site but my issues would be resolved fairly quickly and I moved on. With the length of time I have been have technical difficulties this time around I am glad I dug a little deeper and found this forum. Apparently the Status Page is not reliable, the community seems to be a much better indicator.

Yup, lights haven’t worked around here in a few days, at least. Simple timed lights using the “Lighting Automation” app.

Here are some logs. The strange thing is that it seems like the event fired to turn them on last night (and the previous night) but they didn’t turn on. Also, why are the “stop” events firing twice? (I will admit that I haven’t looked at these logs before, so I don’t know what ‘normal’ looks like.)

SmartLighting failure here, too. But it’s been doing it for over a week. Support was notified & told me it was a known issue & engineering was looking into it…

1 Like

Mode didnt change with routine set for sunrise here in the central timezone. Was wondering why my motion lights werent coming on at all. House still stuck in Night mode.

I think they must have given you the slot I had and I took over yours.

I haven’t had any of the issues you have been reporting here over the last few weeks (that I know of).

But for months I was experiencing debilitating failures that were never acknowledged by support or ST at large (reboot, we’ve refreshed your apps, report back if it happens again, rebuild your rule, edge case, etc). It was awful enough at one point, I did not touch ST for more than 4 months.

In any event, I think it highlights that no amount of methodology or work around (on the user side) can ultimately be depended upon and once again, dynamics unknown to us and out of our control can hit at any time. I think we get into a mode where we think we’ve bested the system, but the system has the ultimate control.

Yup. As an IT guy that has had to ride the “cloud” wave and have seen how awful it can be to give up control of major systems to some vendor, well… this is a great example of why it maybe isn’t the best idea in every case to do that.

1 Like

There were a lot of timeout exceptions occurring all weekend. It seems much worse this morning. (Are community apps being throttled?)

1 Like

We saw a increase in scheduling activity, we are expanding the cluster in the short term (should release some of the pressure) until we can find the root cause, if there is one, could just be a real increase in activity.

1 Like

Wanted to take a quick break to give an update from the engineering side of things as we don’t always do the best job of keeping @slagle in the loop at all times.

There are a few different issues in play on the scheduling side of things right now:

  • Timeouts

This is the issue most people are seeing - in July we pushed out a fix that was causing our database connections to spike and lock up boxes on the scheduling cluster. We were expecting this fix to have a bigger effect on scheduling timeouts than it did. We are continuing to look and there are a few things that are standing out right now. The following is a graph of CPU load/averages across boxes in the scheduling cluster over the past couple of months in tNA00/NA01
(na00/na01 are names for the same shard):

Generally you want your load average to be at a number that is less than the number of cores on the box - if its higher than that you will begin to see back ups. This is a helpful metric in addition to average CPU usage. These execution boxes generally run at 8 cores and we like to keep the load average fairly low in case of a spike in traffic. This can be caused by organic growth of the platform (increase in number of scheduled app executions), a smart app (community or ST) incorrectly scheduling, or a variety of other issues. These graphs seem to follow a trend of when we started to notice an uptick in support tickets. We have deployed additional scheduler boxes to bring the load average back down to pre mid July numbers. Here is what the graph looks like after the deployment:

A running theory is that during high load times a high(er) CPU average may have downstream effects on the box because it will peg the CPU in certain instances when otherwise it wouldn’t have if the CPU utilization was lower as the load average of 4 shouldn’t have been causing issues in the first place. This will likely help with timeouts for the time being but doesn’t address the root cause.

One of our main focuses right now is looking at the ClassLoader of the JVM on the scheduler cluster - a thing to note is that the cluster we use to execute SmartApps off of event subscriptions is running the same platform code and executing more SmartApps than the scheduler cluster but it is not seeing the same ClassLoader issues that the scheduler cluster is. The scheduler cluster is in a state where major GC collections are occurring frequently. When the JVM goes into a major GC it stops all application threads until the GC is complete - causing our scheduled SmartApp executions to stall and eventually time out. We have a few tools at our disposal to help us diagnose this but it’s not something that can be quickly pinned down to a root cause and involves digging into the JVM in Production in a way that has minimal impact on users. Running theory here is that there are usage patterns in the scheduler cluster that we haven’t accounted for yet that require additional tuning. I think is where the most bang for the buck would be in terms of impact on SmartApp timeouts.

  • SmartApps not executing at all

This is the latest in the series of scheduler failures that we have come across. You can identify if a SmartApp is being affected (Our support team does this before forwarding the tickets to engineering) by looking at the scheduled event history for your SmartApp via the Location screen. You will see a “Scheduled Execution” for the correct time in that screen but you will notice that there are no executions at the specified times. e.g) A SmartApp is scheduled to run at 10:00PM every day but the it hasn’t executed for N days:

What we see happening is that ticker logs that a job was sent to the schedule cluster and no further processing is completed. I have (personally) verified that ticker is putting messages on the job queue by polling queues for affected SmartApps and seeing the messages go across the wire. Currently trying to figure out where we can add additional logging to identify the exact spot of the failure to execute but do want to mention that we have a ton of logging in this area already and identifying a path that the code could take without SOME message being produced is where the difficultly is. This could be somehow related to the Garbage Collection but I’d rather not make any assumptions and we are keeping a work stream open in this area as well as diagnosing these GC issues.

  • SmartApps executing at the wrong time

This is another scheduling error that we have seen occur lately but is limited to SmartApps that were executing locally but were then updated to execute in the cloud (either manually or because an incompatible Device Type was added). A bug was causing the update to not unschedule the app from the hub. We have identified the root cause for this issue and a fix has been put in place - currently in lower environments waiting on QA before a PROD deploy.

I think these cover the cases that the majority of the requests we are seeing are related to (@Aaron let me know if I missed one and I’ll see if I can provide more detail)

Also wanted to mention that support isn’t really kept in the loop for these issues in this much detail. The relationship is much more support feeding info to engineering than the other way around. The things that you guys don’t see is that every support ticket that goes in with regards to these (and most other) failures ends up with a very verbose description of the problem that makes it easier for engineering to diagnose these issues. The majority of the time the issues that appear aren’t anywhere as clear cut as these are and I’m constantly impressed with the feedback that they provide.

25 Likes

Holy smokes. Now that’s a meaningful response.

3 Likes

I wish this wasn’t necessary, but boy this is PERFECT! Thank you for taking the time. I know you guys are super busy and don’t have time to spend on writing detail messages, but this is the kind of transparency that many asked for. Sharing your struggles and pain points with us, gives us the fuel to be more understanding when things fail and more optimistic that things will get better. Thank you!

9 Likes

this is probably the most non-st staff response i have seen in the 10 months i have been here. To me it is a level of transparency with the community that has been lacking. It also explains why support’s replies are often less than helpful.

3 Likes

Thanks much, @vlad for the comprehensive discussion of the problems you guys are facing. This is fantastic! And it gives us propeller heads something to chew on in the meantime, LOL :smiley:
Appreciate the effort you and @slagle went to today to provide this information. Also thanks to @Aaron for reaching out to me on my specific issue reported earlier to support.
Nice job, guys!

5 Likes