SmartThings Performance Monitor

Continuing the discussion from Slow response to triggers and events:

This has been a recurring theme on these forums for as long I joined 8 month ago. Every time it happens, we (the users) have been agonizing to figure out who’s at fault. Is it a slow internet connection? Is it radio interference? Is it Zigbee/Z-Wave mesh network issues? Is it SmartThings cloud problem? Or is it just our imagination?

SmartThings has a status page (http://status.smartthings.com) and sometimes it helps to look and see if the ST cloud is having some hiccups. But on many occasions there’s no issues reported on the status page, yet some things don’t work as expected - lights don’t turn on when supposed, garage doors don’t open, etc.

So, instead of guessing, I’ve decided to take a quantitative approach and actually measure SmartThings performance. For an event-driven system as SmartThings, the most important performance measure is event latency, i.e. how long it takes for an event to reach its destination.

Why is it important? Because everything that happens in the SmartThings cloud is communicated using events. For example, when a motion sensor detects motion it sends an event to SmartThings cloud. That event is passed on to some SmartApp, which then sends an “On” command to a light switch device handler. That command is also an event. Then the light switch device handler sends a command to the actual device to turn on the lights.

In order to measure event latency, I wrote a SmartApp that sends events to itself every minute and measures the time it takes to deliver the event. Note that it measures only latency inside the ST cloud, and does not include your Internet connection latency or the latency in the Zigbee/Z-Wave radio network, so it gives an idea how well ST servers handle events. A one-minute sampling rate is not perfect, as it does not allow to detect very short-lived performance issues, but nevertheless, the app provides very valuable estimation of SmartThings cloud performance.

The app logs latency data to Xively feed for historical performance analysis. I’ve been monitoring SmartThings performance for a week now and while typical event latency is surprisingly good (under 100 milliseconds), there’re quite frequent spikes, exceeding 500 milliseconds (half a second). Several times a day, the event pump appears to be stalling, resulting in event latency exceeding 1 seconds, sometimes exceeding 10 seconds and even more.

You can see my real-time event latency feed here:
https://xively.com/feeds/1386935385

13 Likes

very very cool man…

This is good info @geko. Do you know what devices or services are seeing the most latency? In the developer call last night I asked if we would have access to a higher level API that would allow this sort of performance monitoring. I have 80 devices connected to SmartThings, and well over a hundred wireless devices in my house at this point. Visual tools that allow me to track down and sort out bugs are becoming quite necessary.

1 Like

graph is in UTC?

I believe it’s local time (PST).

I’m seeing UTC…

The app measures roundtrip latency of an event it sends to itself. The assumption is that all events go through common event queue. Now, it is my understanding that SmartThing runs on elastic cloud, so there may be several event queues and the event queue that my app is connected to may not be the same as yours, for example.

So if this is just the latency between processing commands only in the cloud and seeing 1/2 a second delays just at this point and frequent stalls. Then add on the average latency up and down on an ISP connection (just ping the cloud server to find out) that would be the delay from hub to cloud and back down.

Then add in the wireless issues for endpoint devices and its not surprising some people are experiencing long delays and others are not.

No doubt ST has capacity issues and processing delays, they can address that, but they can’t fix the ISP latency that a lot of US users have.

There should be a latency check on the hub to cloud and back to warn the user that their connection is slow.

I know the hub does a ping and forced ping, but doesn’t expose the latency in the IDE. Seems like a simple way to measure ISP lag to cloud.

There is a utility to ping the hub, but this doesn’t return back the latency either, just says when it was pinged.

1 Like

ST cloud latency can be measured in a round-about way using a simple SmartApp that provides REST endpoint. Then you just send a request from your LAN and measure response time. This will give you good indication of the cloud latency, including your ISP latency.

Would there be some value in having various users run this concurrently in order to see if the performance issue crosses accounts?

If there is some load-balancing going on, perhaps some accounts will show much better performance than others during any one particular time slot? The results could be overlayed on a graph, and/or averaged…

Not exactly following the process…

What would be the measure points?

You need something to record the time between the two?

So start / log the time the request starts to the REST API then on the recieve and process note the time and log the difference?

Wouldn’t this only be half the equation, its the speed up to cloud, not the speed back down.

A long time ago @Ben mentioned that they were constantly monitoring round trip device latency. I wonder how that can be done effectively at the user level?

Exactly, just run a simple python script on your local host and measure HTTP roundtrip time. The hub probably does not use HTTP, you your measurement will be slightly higher than the roundtrip from ST cloud to the hub and back, but it should be in the ballpark.

To / from what address or URL?

Or from within a SmartApp running in “the cloud”?

From you LAN, of course.

[Your PC] --(HTTP Request)-> [Router] --> [ISP] --> [ST Cloud] --> [SmartApp]
[Your PC] <-- [Router] <-- [ISP] <-- [ST Cloud] <-(HTTP Response)- [SmartApp]

The roundtrip time should be similar to that of ST cloud to hub, minus HTTP connection setup overhead.

1 Like

Is there a way to access the raw data as a download?

Just thinking of doing various spreadsheet analysis, such as counting any period with values > 1 second as “down”, and using that to calculate “% downtime”; charting performance vs. time-of-day, etc., etc…

Thanks!

Yes, the feed is public. You can read it using Xively REST API, for example:
https://api.xively.com/v2/feeds/1386935385

Thanks, but … dang… asking for a login / password; and they are backlogged on offering Developer accounts.

Oh… I wasn’t aware of that. I created a read-only key for that channel. You can try it to see if it works.

CAVH3JQXE5RbTsjEfGIobZp1huamUYlgwHAbIskQkTZyY9Nw

https://xively.com/dev/docs/api/quick_reference/historical_data/

1 Like

You’re right. When I’m logged out from my dev account, I get UTC as well.