Extreme Lag (Cause RM, too many devices/smart apps, etc)

My system has been stable for awhile. I have seen tremendous improvements.

Yesterday I had an internet outage. No big deal, most of my devices run locally. Unfortunately, nothing ran. Strange!

Everything was rebooted and all was well (or so I thought).

Today the lags are in the 2 to 5 minute range. No joke - minutes. I reached out to support. Here’s the responses:

Thanks for writing in. Sorry for the trouble with your devices.

Most of the time when we see issues like this it’s due to SmartApps that are doing a lot of checking when commands are supposed to be sent. Taking a look from my end, I can see you have quite a few SmartApps on your system. Having lots of SmartApps and devices is totally fine, but there may be a few of them that are causing these issues.

Do you know around when you started seeing these delays? Were there any new SmartApps that you had installed around that time that need to listen for multiple changes or check lots of devices?

It may not be the cause, but since Rule Machine has been discontinued by it’s developer it hasn’t been updated. Rule Machine in general has a lot of checks that it performs whenever actions are done and those sort of checks can cause commands to have to go through another layer of processing before they can be sent to devices. I would recommend transitioning your Rule Machine automations into Smart Lighting or something different and try removing Rule Machine from your setup to see if you notice any improvements. Delay of a few seconds is usually expected but 2-5 minutes is certainly well above what you should see, so something is slowing the devices down.

Taking a look from my end, Rule Machine was the main one that stood out to me that does enough checks/actions to cause things to slow down but there were a number of SmartApps that I am not familiar with. If you’re able to test with any of them (ones that can be temporarily removed and re-added the easiest) it may be a good idea to go through some of the larger ones to see if you notice any improvements when they’re disabled. If you find one that you use often that has these issues you may want to reach out to the developer or see if there is anything in your configuration that may be causing it to do extra checks or try and talk to devices that no longer exist.

I also took a look into the Hub events and I’m seeing a large amount of events being generated by a “Belkin:device:insight:1” which is a Belkin device on your local network. I don’t know what device it is, but it’s causing a lot of network traffic and could be bottlenecking your Hub’s ability to send/receive commands from our servers or on the network. If you can identify the device you may want to try temporarily unplugging it to see if that speeds things up too. Also, these network events can generated by devices that aren’t connected to your Hub, so keep that in mind.

These delays are likely caused by a number of sources so make sure to check any avenue you can in order improve your system’s speed. Let me know if you have any questions about anything I said and if you’re still seeing these major delays even after trying what I suggested, please let me know and include any extra details. Let me know anything else I can do so we can make sure your system gets up and running as fast as we can make it.

I responded to each question. Rule Machine was not the cause, Smart Lighting was failing too. Actually, I mostly use Smart Lighting (runs local, right). Every device was experiencing the same problems. The mobile app (iOS) is lagging too - I need to leave the device and go back in just to see updates of any kind. This is new behavior for me.

Thanks for getting back to me. Unfortunately with so many devices and automations, it may not be easy to identify a singular cause to the delay issues but I’m happy to improve the situation any way we can.

For devices that can take commands like lights and switches: Can you try manually activating one of each from the app (different bulbs/wemo/Z-Wave switches) and see if a certain type is delayed more than others?

Let me know what you see for the delay of the different types.

For your reporting devices like motion sensors, door sensors and others: Can you do the same thing, activate them (opening a door, causing motion, etc.) and let me know if there are any reporting differences between the different types of devices?

Please get back to me with those and I’ll see what we can do from there.

My apologies for the trouble, I look forward to hearing back from you so we can things moving forward.

Between the first response and the second, everything started working normally again EXCEPT for the mobile app. I did reboot the hub and then disconnected it from power (and removed the batteries) and that did fix it.

Maybe I’m just in a bad mood, but for some reason I feel unsupported. Nothing has changed on the system - no new devices, no new Smart Apps. I do have a lot of devices (200+), so. I do have a lot of Smart Apps, but not attached to the devices affected, and nor were they firing. Smart Lighting was attached and failing 100% of time, but it’s your app not mine. Rule Machine was not linked to the affected devices. I’m using ST’s WeMo app - if it’s too chatty, then fix it (it’s your code).

So my system is apparently so big that support can’t help:

Unfortunately with so many devices and automations, it may not be easy to identify a singular cause to the delay issues

I don’t need a singular cause, just something useful to try. I thought (and still believe) that something changed on ST’s end.

So having had things running fairly well for the past few weeks (probably longer), and feeling comfortable with how things work (DTH and SmartApps), and having nothing changed for awhile (weeks), I was greatly disappointed in the response. Even though everything (EXCLUDING the mobile app lag) is working now, I am left disappointed in ST.


The sub-quote above is from your second response from Support, right? It looks to me like they really did try to walk through some extensive diagnostics with you, though with limited “metrics / instrumentation / traces” on their end to assist in getting real data.

There’s a difference between calling out a specific app (like Rule Machine or whatever) and making generalizations like “those sort of checks can cause commands to have to go through another layer of processing before they can be sent to devices”.

There is some truth to that assertion, but it depends highly on the specific SmartApp as well as the particular usage of the SmartApp. But here’s the thing: Bruce was probably correct when he said it would be difficult to get Rule Machine “published”, because, as a generalized engine, it is very difficult to QA. It is possible, for example, to write an endless looping rule in Rule Machine. SmartThings wants to avoid such complexity.

Heck … I’ve said it before: SmartThings originally envisioned SmartApps to each be limited to a specific use case with just a few minimal parameters. Part of that reason is to simplify review, QA, and Support.

At the moment, I think that Support is doing the best they can: Trying to isolate the problem to a set of known factors and scope. They need more metrics/traces on their end (which the Operations team should be tracking proactively as well)… I think that’s what @alex has implied they are committed to implementing.

Had it not already been answered I would agree.

They also asked when the issue started. That was the first sentence in my email to support. Shouldn’t have been asked - already answered.

So, I respectfully disagree.

