Platform Health Update - November 2016

slagle · November 18, 2016, 6:15pm

I wanted to take a few minutes to share some very positive news with everyone regarding our platform health. We have reached a big milestone with regard to our platform health and it seemed silly not to share it with you all! As we grow we are tracking our platform health very carefully. We use multiple tools to monitor platform health. Two of the top areas we watch closely are virtual Hubs (vHubs) and virtual Schedules (vSchedules).

vHubs simulate how real SmartThings Hubs operate in the wild and how they interact with our cloud and SmartApps. Each vHub contains two virtual devices (Motion Sensor and SmartPower Outlet) and automations that control the devices. We then measure how long each event takes to execute and respond.

The vHub graphic above represents the time it takes for a “round trip” to happen. This mean the time it takes for a device to actuate, the hub seeing that action, routing it to the cloud, sending that data back to you on the mobile app, and then taking any actions on that actuation.

vSchedules simulate vHubs and virtual devices that use schedule-based automations. We measure the reliability of both Cron and RunIn SmartApps that run every couple seconds to every few minutes.

The vSchedules graphics represent the time it takes to execute a schedule at it’s scheduled time for runIn and Cron. (The migration downtime from last week contributed to the lower “counts” since schedules couldn’t run during the downtime.)

Due to the amazing efforts of our cloud engineering team our metric for tracking vHub latency for the last 30 days is reporting at under 500ms in all regions. Schedules are also functioning with a much lower latency; most of the time below 1 second (with a few exceptions). This means all of your devices, schedules, and SmartApps are operating better than ever.

RBoy · November 18, 2016, 6:29pm

I can say do see the difference overall. There are moments when routines
fail to execute all the actions or smartapp a timeout but great progress.
Now if just the ST mobile apps kept pace…

Dan_Fox · November 18, 2016, 6:30pm

Interesting report. I have been using ST for about 2 months now and I have noticed a much faster response in the last couple of weeks. I’ve been very satisfied so far until the Android update fiasco last night.

TheCellMC · November 18, 2016, 6:55pm

Great work. All we need now is more options in SHM (delays).

helios · November 18, 2016, 8:06pm

I don’t believe I follow… I assume the timings in ms are a 1 minute average, a 5 minute average and so on? The percentages are percentile?

Tyler · November 18, 2016, 8:09pm

Those are groups of events in the last 5 minutes, 1 hour, 1 day, 1 week, and 1 month. The graphic is demonstrating that 99% of those events are happening in under 500ms.

helios · November 18, 2016, 8:12pm

Ah, okay now I see.

You guys should make this publically available
Nice job!

femwitjava · November 18, 2016, 8:25pm

Thank You, Tim. It is great to be kept in the loop - especially when the news is Good news!
Thanksgiving = thankful for improved platform performance
= transparency between ST and user(s)
Keep going and reporting, for those deeply entrenched the communication means everything.

kjdayley · November 18, 2016, 11:58pm

Huge improvement over the last few months!!

sdjernes · November 19, 2016, 8:51am

A report like this on the support page could be useful in debugging if the global network is having problems.

Something that also shows load on links may be useful. I will explain this more. Being as insomniac, I often lay in bed and read the Community / Play with the IDE late at night (CST). The last few days loading of the IDE screens have been really slow. It is even slow now. So if the network is busy, with update push, a lot of event action from a region (Midwest, motion and vibration sensors going off from storms) some or most of us will be nice and leave parts of the system alone so the rest of it can stablize.

As a network engineer, I have stuff babysitting my internet connection. I know when things start to go hocky here and don’t worry too much, if I can see the problem in my interface. What may be interesting is if you gathered some sort of latency data from hubs on other ISPs so that you may know if your links to say Charter or Cox are overloaded or having other problems.

For people in some areas they may make an ISP choice if they know the backbone to real world things is better. We have an up and coming ISP here (Spiral Communications out of Nebraska City, NE), who has made it a priority to have quality low latency links. They seem to have the idea that a connected home or business is important.

jpark40 · November 19, 2016, 8:56am

Can we have this type of info shown on the status.smartthings.com page?

Similar to what Vera does with their status page under System Metrics:

http://status.getvera.com/

baivab · November 19, 2016, 2:48pm

If only you guys fixed the dropping ZigBee connections to/from motion sensors (for example) it would be awesome. In the 1st. chart where you show stats for vHubs containing motion sensors, what did you do? Keep the motion sensors 1 feet away from the hub, I guess? Also put everything inside a Faraday cage maybe so that no other interference affects it and finally routed via cable from hub to cloud? For anything else, i.e. any other setup within a month you’d have had to reset / rejoin those motion sensors at least 10 times … that’s how bad those sensors are!!

These stats need to be taken with a grain of salt… why I say it is because they are hesitant to publish them officially and make it available for everyone (like Staples Connect used to do, Vera does, Iris Premier edition has, etc.) Putting them here, in the forum means only 1 thing - folks are getting jumpy and so let’s give them a tailored stat to pacify them.

BTW: I would love it, if proven wrong.

slagle · November 19, 2016, 6:51pm

You’re right we were hesitant. These numbers were not great before. But, now they are and we wanted to share that. There wasn’t any back handed plan here. We were excited and wanted to share… Plain and simple. It’s a victory for everyone.

SBDOBRESCU · November 19, 2016, 8:14pm

There is nothing to be proved here, you are wrong, because you missed the point completely.

This is just a cloud health check and nothing more. Amazing results comparing to previous cloud performance that is worth sharing. I am sure they are well aware of the dropping devices and perhaps they are looking into straightening that side of the business.

But solvimg that may not be on them. It could easily be a local networking issue, so publishing an analysis on connected devices may feel like sand in your eyes compared to your and many others’ hands on experience.

bamarayne · November 24, 2016, 11:22pm

I can say that since the March meltdown there have been significant improvements… my system is very well behaved these days.

michaelahess · November 24, 2016, 11:30pm

I lost all (6) smart lighting routines yesterday and today. CoRE had paused pistons to step in but that’s my only real failure. Lag has gotten a lot better as well.

a7a93524 · November 25, 2016, 5:06am

I lost all scheduled events today.

m4mazzotti · November 26, 2016, 10:42am

“Cool”, and yet my basic light schedules still fail about 10% of the times.

Lee_Ross · November 26, 2016, 2:59pm

I’m having several different SL scheduled events failing.

Topic		Replies	Views
Weekly Update from Alex - 06/09/16 Announcements	76	10818	July 3, 2016
Weekly Update from Alex - 06/04/16 Announcements	17	3982	June 11, 2016
Weekly Update from Alex - 05/05/16 Announcements	45	7421	May 16, 2016
Performance Tracker (Final) Projects & Stories	51	7868	May 12, 2016
Weekly Update from Alex - 05/14/16 General Discussion uncategorized	39	5003	December 15, 2016

Platform Health Update - November 2016

Related topics