I’m one of many who experienced their v2 Hub going offline today and yesterday. The connection seemed to flap for about an hour before finally not connecting. We were advised via email and the status page to reboot the hub by pressing the button on the back or removing batteries and doing a battery cycle once whatever issue was happening with the ST cloud was resolved.
Quite a few folks here use a hub to monitor or control a vacation home that may be quite some distance from them, so this isn’t really a practical solution all the time. I’m concerned as well. I have routines set up to turn lighting on and off when we’re away from home, and control various home systems. What if I had not been home at the time?
I’m an IT Consultant by trade - I do infrastructure for large enterprises. Part of my job when designing or reviewing systems is looking for redundancy and automation, and making improvements. With so many places doing lights-out infrastructure from across the world in some cases we always need to have a way to perform functions remotely as if we were in the same room.
To that end, here’s my suggestions on how to handle these situations. Obviously it’d be nice if they didn’t occur. But I’m in IT. I know that stuff happens, and that it’s not always expected. So the best course of action is to have failback routines.
ST should consider taking a page from the vendors in the enterprise wireless space, like Cisco, Aerohive, Aruba, etc. Many times a company’s wireless access points are mounted in locations that aren’t easily reached - like a warehouse ceiling 20-30ft up. It’s not practical to bring out a bucket truck and have to reset each one by hand when they can’t talk to their controller or master. Two things happen when this connection is lost:
-
If the access point loses connection with the mother ship, it attempts reconnection. If it cannot reconnect after a certain period of time, say 30-60 minutes, it does a cold reset on itself, as if its power was cycled. It repeats this until it can reconnect.
-
Within a few minutes of losing connection, the access point begins broadcasting a new wireless network, usually using its MAC address or name as the SSID. A network admin with a laptop can then connect to that SSID, and use SSH to establish a terminal session to the access point. The password was previously set as part of the initial config, and using that they can login to the access point and adjust its configuration or reload (reboot) it. Once it talks to its controller it automatically stops broadcasting the “emergency” SSID and resumes normal operation. Keeps you from having to get on a lift to touch the thing physically just to plug in a console cable or hit the reset button.
Sorry for the long explanation thus far. So how would this translate to an ST hub going offline?
-
If the hub is unable to contact the ST cloud, after one hour it automatically reboots. After rebooting if it cannot connect, it reboots again in another hour. You can put all kinds of checks here if needed, like only reboot if it has a IP address, if DNS resolution is available, etc. Simple is usually better though.
-
While the hub is unable to connect, either spin up a very basic web server or even an SSH server and allow local connections. You’d probably want some type of password protection or other kind of access control that would be set as part of the initial configuration. The web page could be totally basic - maybe just a big button that says “REBOOT HUB”. Click it and the hub power cycles. Once connected to the cloud shut down the web or SSH server.
The only problem with the 2nd method is that it would only help for people monitoring a vacation home or remote location if they have some other device on their vacation home network that they can use as a connection point. I don’t think these options need to be mutually exclusive however, both could be done.
I’m making the assumption here that the problem that’s cropped up two days in a row is not the hub locking up and becoming unresponsive - that’s a totally different problem space I’d solve by implementing a watchdog process on the hub itself that’s in a protected execution ring and watches the other pieces of the hub software for some type of keep-alive. No keep-alive, reboot.
Just my thoughts on ways to avoid having to physically access the hub when it needs a reboot. My feeling is that this level of technology shouldn’t require a separately purchased “reboot timer” or any similar level of kludge.