Platform Health Update - November 2016

I wanted to take a few minutes to share some very positive news with everyone regarding our platform health. We have reached a big milestone with regard to our platform health and it seemed silly not to share it with you all! As we grow we are tracking our platform health very carefully. We use multiple tools to monitor platform health. Two of the top areas we watch closely are virtual Hubs (vHubs) and virtual Schedules (vSchedules).


vHubs simulate how real SmartThings Hubs operate in the wild and how they interact with our cloud and SmartApps. Each vHub contains two virtual devices (Motion Sensor and SmartPower Outlet) and automations that control the devices. We then measure how long each event takes to execute and respond.

The vHub graphic above represents the time it takes for a “round trip” to happen. This mean the time it takes for a device to actuate, the hub seeing that action, routing it to the cloud, sending that data back to you on the mobile app, and then taking any actions on that actuation.


vSchedules simulate vHubs and virtual devices that use schedule-based automations. We measure the reliability of both Cron and RunIn SmartApps that run every couple seconds to every few minutes.

The vSchedules graphics represent the time it takes to execute a schedule at it’s scheduled time for runIn and Cron. (The migration downtime from last week contributed to the lower “counts” since schedules couldn’t run during the downtime.)


Due to the amazing efforts of our cloud engineering team our metric for tracking vHub latency for the last 30 days is reporting at under 500ms in all regions. Schedules are also functioning with a much lower latency; most of the time below 1 second (with a few exceptions). This means all of your devices, schedules, and SmartApps are operating better than ever.

38 Likes

I can say do see the difference overall. There are moments when routines
fail to execute all the actions or smartapp a timeout but great progress.
Now if just the ST mobile apps kept pace…

6 Likes

Interesting report. I have been using ST for about 2 months now and I have noticed a much faster response in the last couple of weeks. I’ve been very satisfied so far until the Android update fiasco last night.

Great work. All we need now is more options in SHM (delays).

2 Likes

I don’t believe I follow… I assume the timings in ms are a 1 minute average, a 5 minute average and so on? The percentages are percentile?

Those are groups of events in the last 5 minutes, 1 hour, 1 day, 1 week, and 1 month. The graphic is demonstrating that 99% of those events are happening in under 500ms.

1 Like

Ah, okay now I see.

You guys should make this publically available :slight_smile:
Nice job!

1 Like

Thank You, Tim. It is great to be kept in the loop - especially when the news is Good news!
Thanksgiving = thankful for improved platform performance :slight_smile:
= transparency between ST and user(s)
Keep going and reporting, for those deeply entrenched the communication means everything.

1 Like

Huge improvement over the last few months!!

1 Like

A report like this on the support page could be useful in debugging if the global network is having problems.

Something that also shows load on links may be useful. I will explain this more. Being as insomniac, I often lay in bed and read the Community / Play with the IDE late at night (CST). The last few days loading of the IDE screens have been really slow. It is even slow now. So if the network is busy, with update push, a lot of event action from a region (Midwest, motion and vibration sensors going off from storms) some or most of us will be nice and leave parts of the system alone so the rest of it can stablize.

As a network engineer, I have stuff babysitting my internet connection. I know when things start to go hocky here and don’t worry too much, if I can see the problem in my interface. What may be interesting is if you gathered some sort of latency data from hubs on other ISPs so that you may know if your links to say Charter or Cox are overloaded or having other problems.

For people in some areas they may make an ISP choice if they know the backbone to real world things is better. We have an up and coming ISP here (Spiral Communications out of Nebraska City, NE), who has made it a priority to have quality low latency links. They seem to have the idea that a connected home or business is important.

3 Likes

Can we have this type of info shown on the status.smartthings.com page?

Similar to what Vera does with their status page under System Metrics:

http://status.getvera.com/

2 Likes

If only you guys fixed the dropping ZigBee connections to/from motion sensors (for example) it would be awesome. In the 1st. chart where you show stats for vHubs containing motion sensors, what did you do? Keep the motion sensors 1 feet away from the hub, I guess? Also put everything inside a Faraday cage maybe so that no other interference affects it and finally routed via cable from hub to cloud? For anything else, i.e. any other setup within a month you’d have had to reset / rejoin those motion sensors at least 10 times … that’s how bad those sensors are!! :slight_smile:

These stats need to be taken with a grain of salt… why I say it is because they are hesitant to publish them officially and make it available for everyone (like Staples Connect used to do, Vera does, Iris Premier edition has, etc.) Putting them here, in the forum means only 1 thing - folks are getting jumpy and so let’s give them a tailored stat to pacify them.

BTW: I would love it, if proven wrong.

You’re right we were hesitant. These numbers were not great before. But, now they are and we wanted to share that. There wasn’t any back handed plan here. We were excited and wanted to share… Plain and simple. It’s a victory for everyone.

12 Likes

There is nothing to be proved here, you are wrong, because you missed the point completely.

This is just a cloud health check and nothing more. Amazing results comparing to previous cloud performance that is worth sharing. I am sure they are well aware of the dropping devices and perhaps they are looking into straightening that side of the business.

But solvimg that may not be on them. It could easily be a local networking issue, so publishing an analysis on connected devices may feel like sand in your eyes compared to your and many others’ hands on experience.

6 Likes

I can say that since the March meltdown there have been significant improvements… my system is very well behaved these days.

4 Likes

I lost all (6) smart lighting routines yesterday and today. CoRE had paused pistons to step in but that’s my only real failure. Lag has gotten a lot better as well.

I lost all scheduled events today.

“Cool”, and yet my basic light schedules still fail about 10% of the times.

I’m having several different SL scheduled events failing.