SmartThings Outage - Jan 03 2018

No problem, there have been a lot of conversations about what runs without the cloud this week, not surprisingly.

I thought about writing a post on it but the fact is it’s so complicated and depends so much on the exact setup with both devices and automations that I couldn’t even get started on it. :scream: There are so many different possible configurations which then end up with different results as far as the various points of vulnerability: the hub itself, the Internet, the SmartThings cloud. So another thread would probably be a good idea, it’s going to be a complicated discussion. :wink:

Oh I was onboard when that happened… I ended up having to send my hub TO them for them to un-brick it. The turn-around on it was very fast though. That was the only major-hiccup I ever had with Wink other than my GE bulbs being a pain in the butt and losing their connection every month or so.

I still love the flexibility the ST system offers, but man it’s frustrating after a long day of work…coming home and having my lights doing all sorts of weird shit or not working at all.

1 Like

You guys don’t know how good you got it. The only thing available in our area is 10 meg DSL. And with that we’re lucky if we get 8 meg. I guess there is satellite, but too pricey for my budget.

1 Like

:grinning: There are pros and cons to everything. I’ll bet you are only paying around $29.00 for your internet service versus $160 per month. Who is your provider? Something like Charter?

Ya, satellite you are going to pay a hell of a lot more and you aren’t going to get that much more speed. For example, if you went the Hughes route and their 50gig plan, you are going to pay $100 to $125 for a Max speed of 25mb download with a 50gig data max.

We are on Centurylink. We pay $45/month for 10 meg service, which is the fastest available in this area. There is no cable or other provider.

I did check satelitte and it would run $150/month, 2 year contract, 20 meg speed. Unlimited data download. (or so they say).

Since we do a lot of streaming, (we don’t have satellite TV anymore) I use about 90 gig a month.

The biggest problem we have is the DSL slows down a lot at night. It’s not the modem connect speed but the actual internet speed. And that makes it difficult sometimes to watch any TV. We do have an OTA antenna which accounts for a lot of our TV so it’s not the end of the world. Or course Centurylink only cares about the modem connect speed so they don’t seem inclined to do anything about it.

But that’s the price you pay for living in a rural area. We are in southwest Missouri, a few miles outside of Branson. Little town of Reeds Spring. Great place to live, just not high on the ISP’s list of places to service.

I’ve added a how-to on planning for outages which might be of interest to those following this thread.

It doesn’t give specific solutions because there are just too many variables depending on the exact devices and automations being used. But it does describe the different types of outages, some important use cases you may want to think about, and encourages people to set up their own discussion thread under projects if they want help brainstorming their own “Plan B.”

3 Likes

Hmm, I wonder if this is why I’m getting

“this device in unavailable”

On all my sensors. In UK.

No activity showing from last 7 days in the app either.

Okay, time to jump back to this drama… Good folks…Is the general consensus this outage is fixed? Or still sporadic issues? I am seeing “this device is unavailable” left and right…

I am also attempting to pair new Z-Wave sensors, same BS as before… Delay, Delay, Delay and out-of-sync status in the app vs. physical device.

My system started going haywire again 2 hours ago. Is there a new outage thread or will this thread continue as one big outage?

This blows; I may need to revisit OpenHAB. :frowning:

Pounding on the Z-Wave repair button 5 times in the IDE seems to have resolved some of my issues.

Except for my Aeon Siren, which insists on being “Unavailable”, yet when I trigger a test siren, it works and yet the app still is stupid and shows as unavailable. Still seeing weird crap. It would nice if the ST staff chimed in with an update if one hasn’t been provided already…

I’m chalking this up to rushed code deployment…Likely management driven expectations to hit dates over ensuring quality. I don’t think this is a product of the ST dev team. Most problems like this are created as a result of higher level strategic stupidity, which driven actions that ultimately cause problems… EXAMPLE: “Oh, gotta hit that date so management doesn’t get pissed, so…let’s roll spaghetti style - throw the Sh|t out there and see if it sticks”. LOL. Like it or not, we ARE the test bed…

Comparing to other IoT platforms, of which I have at least a dozen, most of which use Amazon Web Services (AWS), this is the only one I’ve seen problems with. So thoughts on this being related to the “Meltdown” and “Spectre” is timely plausible, but we’d be seeing quite a bit more sh|t going down. Pretty certain this is squarely in the ST camp, thus the radio silence…

Hey ST Staff - Most of us understand this is complicated stuff and sh|t breaks. Just tell us someone F’ed up and how if possible and I think your user-base will be more forgiving… X happened. Y is what we’re doing to resolve and prevent and in the future it. Z - the end… It’s that simple. If you’re adopting (force or choice) the Samsung mentality of “Say nothing until Korea approves”, then both of us are in for a long haul of user distrust. :wink:

1 Like

I think this is an AWS issue. My ring door bell is having issues too. Same regions in AWS.

I am having significant lagging. Tried doing the Wave repair. Showing 4 controllers unhealthy. Not sure how to determine which. This outage has been a pain I must admit.

1 Like

If the repair was done via the https://account.smartthings.com/login “My Hubs” —> “View Utlities”—> “Repair Z-Wave Network”, there is a link that appears afterwards you can click to view status. That should tell you the specific device with issues.

Running repair multiple times cleared rage problem devices. It can take a while for details to show up in the IDE. Try not to navigate off the page or multi-task as many browsers (i.e. Chrome) nowadays will put a tab to sleep if it’s in the background, so you’ll miss the log output.

I have Ring as well. Haven’t noticed any issues, but now I’m intrigued and will test. :slight_smile:

AWS Status Page Here ——> https://status.aws.amazon.com/

1 Like

I also noticed Ring recording and notification failures around the same time SmartThings was failing. Currently, Ring seems OK, Smarthing seems to fluctuate based upon the responses with the Xfinity Keypad and Lannouncer/Big Talker speech.

I’ve noticed if I go into the /dev interface and select my hub and then List Events, I’m seeing hubStatus zb_radio_off events right around the time a lot of my devices become unavailable. I’m pretty sure it’s my zigbee devices (e.g. Cree bulbs). About a minute after the zb_radio_off event I get a zb_radio_on event. I suspect that when the radio is cycled it takes a while for all the zigbee devices to re-connect.

I have no idea why the zb_radio_off events are occurring, but I got a few of them early this morning (like around 5-6 AM, EST) and the one in the log capture below early this afternoon. I’m not doing anything to my hubs like resetting them or power cycling them. Dunno if SmartThings is able to issue commands from the cloud that are cycling my zb radio. Wondering if this could be a hack/virus?

Event Log screenshot, note the zb_radio_off event at 1:49:43 PM:
Screenshot-2018-1-7 Events List

$160 / month! I pay $50 for 250MBit. You need to call Comcast and all for a better deal.

Hehe, No I don’t. My ISP is for business (24x7), not just residential with 10 additional email addresses, and other options. Downtime. Don’t know what that is with my ISP. Knock knock knock… :sunglasses:

So I have been live on my hub since Black Friday. This outage lingered on longer than I expected. Things seem to be normal for me as of this morning. This is my first outage.

So how frequent are these outages?

I have been using SmartThings for a little over a year and this is the most significant / widespread / longest outage that I have been witness too. Long timers 3 to 5+ years on SmartThings can tell you about outages of the past that top this little one by miles. Outside of the outage there have been a bunch of smaller scale type of things such as the UK having problems with Modes and Routines ignoring those Modes for about 3 weeks. There have been some functional elements that have been removed or broken that to some are pretty significant. Overall my ST experience has been about 90% positive, then again I treat this as more of a hobby type project and don’t allow my home to be 100% reliant on SmartThings. Here’s a topic that might be useful to you to plan for the future:

I have a feeling that this wasn’t a complete outage from start to finish. Day 1 appeared to only affect users on the na02 shard (myself included). Day 2 - All na02 users for the most part were restored (never had an issue after first 24 hours) and na04 users are now affected. Day 3 - Seemed to be more sporadic issues for various users, and maybe another shard. It almost looks like that each day represented a different set of users or shard / url. That’s just a guess based on what was posted. If every single person were to post what country and shard they are on, it makes it easier for us community members to make a little more rhyme or reason as to when, who and what is affected when we haven’t heard anything back directly from ST.

1 Like

There were 7 planned outages in 2017, but some of them had problems and there had to be multiple fixes deployed which resulted in more outages over a couple of days.

There were at least 12 unplanned outages as well.

Sometimes the outages only affect one region, sometimes they affect multiple. Sometimes they only last about 15 minutes but sometimes several days or in the case of device – specific problems, weeks.

Major outages like the one we just had that affected a large percentage of users for 12 hours or more seem to happen about twice a year just based on the historical data.

1 Like