Outage Affecting STHM and some automations (19 Jan 2021)

I can tell you what I had in mind when I suggested in another thread that I could see a failure mechanism that Virtual Switches have that Simulated Switches wouldn’t have. It’ll be a bit wordy as I want to cater for a broad audience.

Legacy Groovy device handlers (which the stock Virtual Switches and Simulated Switches are) set device attributes by sending an event. These events then get propagated to apps that have subscribed to them. The apps can choose to subscribe to certain values, e.g. on or off in the case of switches, but it is arguably more common to take what comes. Apps are often particularly interested in changes of state, so for a switch they want to know about changes from on to off or off to on. The legacy SmartThings platform agrees and by default only propagates the state changes, so apps only see on, off, on, off etc. That is how the Simulated Switch works.

Sometimes it is useful for apps to see every event regardless if it is a change or not. For example you might have a button used to turn a switch on and you might want to see every press on it. So you want to see on, on, on etc. This can be achieved with the isStateChange: true flag when sending events. This means that the event should be considered as a change of state regardless of whether the value has changed. Apps can see off, on, on, off, on, off, off, on etc. For whatever reason, this is how the Virtual Switch was written.

Now let us consider the particular example of the using six automations and three virtual switches to expose the STHM status (Security Mode) to third party apps. It basically distills down to three pairs of Automations that say, for example:

if
  Security Mode is disarmed
then
  turn Disarmed switch on
  turn Armed (Stay) switch off
  turn Armed (Away) switch off

and

if
  Disarmed switch is on
then
  Set Security Mode to disarmed

Bearing in mind that automations (however implemented) typically get triggered by changes of attribute state, what you have there is a potential infinite loop. Turning on a switch changes the mode, which turns on a switch, which changes the mode etc.

The user is entirely dependent on SmartThings stopping that infinite loop happening and we know it can do it as it has been doing so for a long time. So how does it do it?

Well if the user chose to use the Simulated Switch, or other custom handlers that behave the same, they have a result because if, for example, you set the Disarmed switch to on when it was already on you aren’t going to have the event propagated and it won’t start another Automation.

However lots of people use the Virtual Switch handler and that will happily pass on continuous on events. So now we are relying on SmartThings helping us out in other ways. For example, maybe you can’t repeatedly set the Security Mode to disarmed. Well anyone who has seen repeated notifications from the app knows that you can so that can’t be it.

So what is it? Well to be honest I don’t know for sure. However the Automations aren’t legacy apps, and new integrations are working with a different API. I seem to remember seeing the state change flag on subscriptions. So maybe they can choose to only subscribe to state changes. That would explain why things worked OK.

So what has changed? Well apparently Automations and Scenes have been reengineered as front ends for the Rules API rules in the Rules API. So is the new implementation of Automations also only subscribing to state changes, or is it seeing everything? I don’t know, but if it has changed it would explain some apparent infinite looping issues with Virtual Switches.

Update: I had it in my head that the Automations and Scenes would still be entities in their own right that were effectively a front end for the Rules API. Things make more sense if they ARE rules and that what we see in the apps is derived back from the rules. Indeed ST have already pretty much said that is what they are, I just didn’t completely grasp it.

All I know for sure is that a mechanism like I describe above could explain certain reported issues. It could be correct, it could be broadly along the right lines, or it could be complete nonsense. It was where I was coming from though.

6 Likes