Why things fail


(Brian) #1

I think this one screenshot summarizes a lot of people’s problems. Too many slow device actions in one routine/automation/app. Note how slow my Honeywell DTH is to respond. Remember that ST halts any executing app over 15 seconds. Blammo.

It also demonstrates how awesome CoRE is.


(Brian) #2

And a relevant post I once made. Still accurate except I use CoRE now, and the Android app is much improved, but still needs work. I’m also well over sixty devices now including an Echo.


(Brian) #3

And lastly: ST, is there a way to help users avoid this trap without arbitrarily limiting them?

Ideas

  1. Kill calls by device, not the app. So if a call to a device goes more than 10 seconds, kill that call, not the app.
  2. Measure processing time, not wait times. Calls to device and waiting don’t count towards app run time.
  3. Similar to 2., measure CPU time not run time.
  4. Sloppy, might not recommend: Increase app run time when platform is performing, restrict when slow. This would still affect people, but I think less.
  5. Dont love this idea: Limit users to a high number of devices (say 10) so they at least get an inkling that more is bad.
  6. Use methods like @ady624 to encapsulate device calls.

Just had some time this evening to think about these things. @slagle @jody.albritton if they help…


(Tim Slagle) #4

We are working on a way to make HTTTP requests faster. This will solve a lot of the problems like you are describing.

EDIT: Well not the requests themselves faster, but not make the time between request and receiving data matter.


(Brian) #5

Solves one aspect of the problem. However I still theorize a non-responsive device of any sort will take out an app and anything it was trying to accomplish. It’s deeper than my single example, but appreciate the planned improvement all the same.


(Never Trust @bamarayne) #6

I will add, where there is not a technical solution. Document the problem and provide a best practices guide to the community.

The conclusions drawn from trial and error are one thing, but confirming with the back end engineers and providing solid guidance to the community would be tremendously helpful.

If one were to follow such methods described by a community member, it’s entirely possible one could be making the problem worse rather than better.


(Geko) #7

Nonsense. All commands in SmartThings are asynchronous, or at least they’re supposed to be. The DH or the app does not wait for the command is complete. The “execution” time you’re seeing in the log has nothing to do with how “slow” the device is, but everything with how slow the server is.

Read the docs:

When dealing with the physical graph there will always be a delay between when you request something to happen and when it actually happens. There is latency in all networks, but it’s especially pronounced when dealing with the physical graph.

To deal with this, the SmartApps platform utilizes asynchronous execution. This means that anytime you execute a command, it doesn’t stop everything else from running. This helps everyone’s code run the most efficiently.

http://docs.smartthings.com/en/latest/architecture/index.html#important-concepts


(Ben W) #8

How may synchronous threads can ST run? You are often limited by the number of threads the processor can handle.

The thread should stay alive until it gets a 200 response (or equivalent in zwave), unless its a fire and forget, which has its own problems.

Could also be a SmartApp runs in a single thread, but you can have multiple threads running different smartApps or Device Handlers.


(Brian) #9

It’s a theory. No problem if I’m wrong… I have also read that it is fire and forget.

However it might be hard to argue with my results. I have very little failure.

Maybe it still requires some acknowledgement of the fire command that the platform itself can’t supply quick enough. So the physical device isn’t limiting, but the acks are piling up.

Anyways, results.


(Bobby) #10

There is a merit to keep it simple, no doubt. I’ve had a similar experience following a similar approach, however, that doesn’t mean we weren’t just lucky. There are reports of routines changing modes that fail regularly. There is no DH involved in those. So, yes, I dropped all but Hue c2c and have had success running routines, but there are other things under the hood that go wrong from time to time.


(Bobby) #11

Hey Brian, out of curiosity, how many devices do you have?


(Brian) #12

85 things in my list. Probably 5-7 of those are virtual. 10-15 cloud to cloud. The rest a mix of ZigBee and Zwave.


(Fast, Good, Cheap...pick two.) #13

I know there are different experiences based off all kinds of variables. The fact that my setup has run flawlessly at times tells me that it isn’t my complication.

…granted I have (many times over) screwed up my logic unknowingly and blamed ST…only to figure out I screwed up.

I just laugh it off most times as it appears to emulate life…good times / bad times.

I don’t think there is a magic setup…only good days and bad


(Aaron S) #14

Our goal is to get all good days (or atleast five 9s worth).

If you have SmartApps that are not executing, shoot a note over to support@ detailing the approximate time/date of the failure, what was expected to happen, and if available - screenshots, etc. If you DM me the ticket number, I will flag it for deeper investigation and ensure the logs are pulled.


( I hate Mondays) #15

I believe that ST is a mix of sync vs fire-and-forget. Commands definitely take time to execute (talking about device.command() in Groovy). Some consistently more than others. I assume (notice the word assume?) that the command is synchronous to the point where the DTH code is spun up and executed in the cloud. Then the ST app resumes. Meanwhile, the command that is returned during the DTH command execution is sent over to the hub which then sends it to the relevant network. At that point, that’s a fire-and-forget, from the SmartApp’s point of view. The SmartApp may run other commands while the hub is receiving commands to be executed in the mesh network. At least that’s what I see from a programmer perspective.


(Brian) #16

If your assumption is correct and I think it is, this is why having too many actions in anything but CoRE can hurt. Just issuing the ten synchronous device commands can take too much time. I’ve watched my Honeywell’s in my logs ( I have a custom DTH I run/maintain), the code requires some wind-up to execute.

BTW, this isn’t easy for ST to solve, not everything can be asynchronous. It gets complicated fast! But it is their job, so good luck to them and hope our discussing it helps in some small way.