Zwave Mesh Fragility

After experiencing multiple days of weird issues with my aeon multi-sensors I discovered that my aeon home energy monitor had become unplugged by mistake. It appears that several devices were using it as a node and its disappearance triggered a wave of instability. Is there a way to get notified when a device goes offline or that the zwave mesh is having trouble communicating? This was a bizarre issue to troubleshoot and I only found it after doing several mesh repairs. On the third or fourth repair it showed an error with the home energy monitor. After getting the home energy monitor back on things have been working a bit better.

The biggest puzzler for me is that their were many other zwave repeaters for those devices to connect to. Shouldn’t it be the case that those devices that were orphaned move to another node during the first repair I did?

2 Likes

I don’t think zwave is this smart…

1 Like

Then what happens during a zwave repair?

1 Like

I always thought that one of the purposes of the “mesh” aspect of z-wave was to make it easier for things to recover if a single node goes down…

Edit: http://en.wikipedia.org/wiki/Z-Wave#Topology_and_routing

Short answer: No. This is the unfortunate side effect of mesh networks–they tend to keep trying multiple times before abandoning hope that the connection can be re-established.

See the Smart Things blog article on repeaters and what happens when a network is broken:

I don’t know exactly how SmartThings has implemented their zwave repair, but I can say that most of the time people don’t give the child devices enough time to complete their “I give up” sequence.

In particular–if you have any smartapps running that do interval polling during the time the heal is taking place, you run the risk of repopulating an old node table and not forcing the new routes to form.

If you want to heal a zwave network, the usual field engineer approach is: unplug the hub for 15 minutes. This should be enough to put all the child nodes into “help, I’m lost” status.

Restart the hub, and idle any processes that do polling.

Heal the network.

Wait another 15 minutes so that all the lost children can find new parents.

I notice ST’s official support page suggests waiting “30 seconds to 15 minutes.” I would use the 15 minutes as a minimum.

https://support.smartthings.com/hc/en-us/articles/200981864-How-do-I-make-sure-my-Z-Wave-devices-are-routing-optimally-

Don’t just go by the log entries saying that the heal is complete! That does mean the routing table has been rebuilt at the hub, but it doesn’t 100% guarantee that all the children have the new routes. Give it time for the new routes to propagate. Some field engineers will perform healing 3 or 4 times to make sure all alternate routes have been considered.

I think where people get confused is they assume “self healing” means “on the fly correction.” It doesn’t. The zwave protocol is what’s called a “source routed mesh network.” That’s why it requires a hub. The hub figures out the routing tables in advance with several alternate routes and saves those in a static routing table. That’s what gets used.

“As a source-routed static network, Z-Wave assumes that all devices in the network remain in their original detected position. Mobile devices, such as remote controls, are therefore excluded from routing.” It also assumes all the original devices are working as they were when included!

Typically when a device goes out of service, the hub just tries the second or third alternate route, one of which make work. It probably takes more time, and you may end up with traffic overload.

Zwave is “self healing” because the router can be made to fix its own routing table to find the optimal first route without requiring a human to assign each route. That’s what the healing process does. But it takes time and if you’re running other processes at the same time, you can have problems.

The newest Zwave protocol, gen 5 (commonly called “Zwave Plus”) introduced Explorer Frames which are intended as a better way of pruning missing or damaged nodes. But it’s still not on the fly routing.

So:

Keep the hub offline for at least 15 minutes.
Idle any processes other than the heal during the heal process plus 15 minutes.
And if you really want optimal routing, repeat the heal 3 times. But I know that’s a long time to be offline.

All of this is true for any zwave network, it’s not specific to SmartThings. But because it’s a source routed mesh network, you can’t just run the Heal and fix the network. You have to get all the devices up to date before you start processes other than routing going again.

BTW, many field engineers will take an overburdened parent off line during the entire heal process, then re-add it last WITHOUT doing a heal. That can break up a bottleneck. This does tend to be more of a problem in zigbee networks which prioritize parent selection to nodes with higher signal strength, but can affect zwave as well.

FWIW…

14 Likes

I should add…

Some network controllers are configured to shut down all other processes while a heal is going on. But most work the other way–they stop the heal if any other traffic comes through. This is why allowing other processes to run may cause you to end up with a new routing table but without having propagated it to all the end nodes.

Also, if you have any battery powered zwave devices, typically a doorlock or a smoke alarm, the router will usually wait until the child wakes itself up and then includes it in the heal. This can dramatically increase the time a heal takes to fully propagate. Personally, I’ve always suspected that the main thing repeating a heal does is pick up battery operated devices that were asleep the first time around, but I’ve had people argue with me over that. If somebody knows for sure, please share!

In any case, it’s just one more thing to be aware of. The more battery powered devices you have, the longer it will take to get an accurate new routing table and get it fully propagated.

The battery powered devices don’t act as repeaters, so that doesn’t affect neighbors much, but it does affect overall routing efficiency.

1 Like

It’s to make it easier for the person to get the repair done, because they don’t have to calculate all the routing tables themselves. It’s not necessarily self detecting on the fly repairs, though.

Both Zwave and Zigbee home automation are “source routed mesh networks,” which means the hub calculates all the best routes through the network, including some alternate routes, and stores these in a routing table.

To get new routes, you have to force the hub to redo all its calculations, which is harder than it sounds because it won’t consider a node truly “dead” until the hub has made 5 requests of the node each of which failed 3 times. (At least those are the usual settings.) This is to keep a node counted as “alive” if it’s just run into some temporary local interference, like what happens to WiFi when someone runs the microwave.

So a heal will force the hub to redo all the calculations, and then it will try its best to update the routing tables at all the nodes so everybody knows what to do. But it is possible for the routing at the hub to still include some dead nodes that it thinks are still alive. And it’s very possible for a particular end node not to get fully updated with the new routing, particularly if you let other zwave traffic run during the heal.

2 Likes

This is very interesting stuff!!

I read a post recently on a Homeseer forum written by a guy who had moved to Homeseer from Vera. His post was about how Homeseer has much better z-wave tools than Vera had, and that he was able to figure out why many things were flaky with Vera, all having to do with the Z-Wave mesh. One of the key things he learned was that devices needed to be included from a very close distance from the hub, like 2 feet or less. He said that if they were included from greater distances, while they would appear to be included correctly, they might have configuration flaws that would never be corrected by a network heal.

I really wonder about this. I know I have recurring sporadic delays that appear to be caused by my Z-Wave network, although I’m not certain of that. I’m thinking that perhaps I will tear down my entire z-wave network, and start over. I would rebuild the whole thing with physically taking the hub to each device.

Any thoughts?

1 Like

That would be true for door locks, which have a special security feature requiring the primary hub (not a secondary controller like the minimote) to be physically very close to the lock so that the security key can be exchanged at the time of the pairing. Usually within about a foot. If instead you include the door lock from a secondary controller, or from further away, the device will be added to the network as a zwave node but you won’t be able to control the lock. This cannot be corrected by a heal, because it’s not a routing issue. Instead you have to exclude the lock, usually do a reset (may vary by specific model) and then include it again at the proper distance.

This handshake utilizes zwave beaming (secure Communucation) but involves more than that. It’s not just the transmission of encrypted packets but the actual join itself.

OK, once you have the lock established as a lock you can move the hub back to its usual location.

Now you need to know that only devices that support beaming can act as repeaters to carry encrypted packets, even though they don’t decrypt them themselves. Most of the GE/Jasco non-battery devices do, as well as some other brands. You can confirm beaming support at the zwave alliance website by checking the product’s “conformance statement.”

http://products.z-wavealliance.org

Repeaters that support beaming do NOT have to be paired at the super close range, because they don’t do the extra security check. However, they may have to be paired directly with the hub as many secondary controllers don’t recognise beaming attributes.

The aeon minimote does support beaming; the gen 5 key fob from the same company does not. That said, I personally would probably pair any device I hoped to use as a beaming repeater directly with the ST hub, not via the Minimote. But I haven’t tried it.

But I don’t see any reason why other zwave devices would need to be paired at the super close distance, as far as I know it’s only the secure communication protocol that has that requirement.

(Which reminds me: when doing Internet research it’s important to remember that zwave has gone through 5 protocol generations and at least 3 chipset builds. An article on “best practices” dated 2008 or 2011 may well not apply in 2015. )

I suspect that the old advice to do 3 repeated heals in a row has to do with the 4 hop limit. We still have a 4 hop limit, but the routing table is built differently than it used to be, and healing is one of the areas where things have improved considerably with newer generations.

FWIW…

1 Like

I believe issues like this could be resolved if we had some better network health monitoring tools. A network map that exposed the signal strength indicators would go a long way. This has been a long discussed issue going all the way through 2014. We need better diagnostic tools.

9 Likes

@JDRoberts, being that you seem very familiar with Z-Wave, I have to ask this question (that ST has hinted can be done, but doesn’t answer anytime someone asks HOW to do it.)

How can the aeon minimote, associated to ST has a secondary controller, be used to pair a device to the ST controller? The minimote docs claim it’s possible in the same section as saying that it’s not.

I know how to get the minimote into “add” mode (or exclude mode), but do I have to do something in the ST mobile app so that the ST hub is expecting a new device? (If so, what?) I’ve tried a few random things over this past weekend, but with the massive instability of ST during this time, I have no idea if they were even valid tests.

Thanks
Gary

1 Like

@garyd9, I have done this with great success, and it’s quite easy. In the inside compartment of the Minimote are four extra buttons: Include, Exclude, Associate and Learn. Learn is used to include/exclude the Minimote to/from ST. Once included, other Z-Wave devices can be added to ST, or excluded from ST, using the Minimote. If you press the Include button, the blue LED will start blinking (now in Include mode). Bring the Minimote close to the device you want to include and press it’s button. Quite quickly the Minimote will see the new device and it will be included to ST. It will show up in the mobile app as if you had included it with the app – when refreshed Things will show the new device which then can have it’s name set, etc. The Minimote will still be in Include mode, with the blue LED blinking, and you can move to the next device to be included.

To exclude devices with the Minimote is very similar. First, you press the Exclude button. The red LED will start blinking. Put the Minimote close to the device to exclude and press it’s button. The device will be excluded from ST. NOTE: Be sure to delete all of the device’s SmartApps prior to excluding the device, or you will end up with a partially deleted device, and a mess to clean up with Force Delete. As with inclusion, the red LED will continue to blink, and you can move to the next device to be excluded.

At one point I separated two distant bedrooms from my main ST Hub due to Z-Wave problems, and set up a second hub. I did all of the device exclusion using a Minimote paired to my main ST hub. Then I excluded the Minimote from the main ST hub, and included it to the new hub with the Learn button. Then I used the Minimote to include all of those dimmers to the new hub. This use of the Minimote made the whole process very easy and it goes much more quickly than using the mobile app.

5 Likes

The vast majority of my dimmers DO support beaming (Leviton DZMX1). However, as I described in the preceding post, a subset of these for two rooms were included to ST using a Minimote. They seem to function just the same as all of my other ones. The bulk of my devices were included to ST in situ, and for some of them it was tortuous to get them recognized by the hub (the hub stayed where it is). Your statement above makes me wonder whether having included them that way, their beaming functionality might be compromised.

So just include it “normally” with the minimote and it’ll automatically show up as a new (unconfigured) device in ST (once the mobile app refreshes)?? That’s too easy… WAY too easy…

I have to admit that, especially when ST is having a Bad Day, adding new devices is a royal PITA using the mobile app. Using the minimote (instead of unplugging and re-attached the hub in different areas of the house) would really make that easier… I’ll be getting some more crack… (er… I mean switches) Wed and will try using the minimote…

I wonder if the process of excluding and re-pairing can be used as a mechanism to “repair” parts of the mesh network (or if it would do more harm when those devices being excluded are along the routes of other devices.)

The problem is we don’t have the tools to know which devices might need repair. There are tools that analyze z-wave networks, but none of them work with ST, as far as I can figure out.

I was able to include an Aeotec Z-Stick into ST as a secondary controller, plugged into a PC. I had hoped perhaps some of the diagnostic software might work with the Z-Stick (which sees the whole Z-Wave network), but so far I only have found a basic utility that works.

It’s painfully obvious that we need better z-wave diagnostic tools.

5 Likes

@garyd9, as bravenel said, it does work and it’s quite easy. I’ve also used it.

I just wanted to mention that if you have a gen 1 Minimote, the button labels are different but the functions are the same.

  • “-”= exclude
  • “+” = include
  • “Join” = Learn.

The blank button is used with Learn for a factory reset on the Minimote.

Also,
“Bring the Minimote close to the device you want to include and press it’s button.” Means
"Bring the Minimote close to the zwave device you want to include and put that device into pairing mode. Usually there’s just a button on the device you want to add that you will push."

That one was probably obvious, but just wanted to clarify. The Minimote can only add zwave devices, which is probably why ST support doesn’t talk more about the ability.

2 Likes

Yeah, I have the 1st gen with updated firmware, and just mentally translate these days. I usually translate the blank button to be the “associate” button on the 2nd gen minimote. (That was the reason I bought the thing to begin with - and then ended up not needed the associate function as I’m using jasco 3-ways instead of linear/evolve 3-ways.)

@bravenel ,

I may just be overly cautious. I don’t know for sure whether there’s an issue with recognizing “beaming is supported” when the device was added by a secondary or not.

@garyd9 ,

You can’t heal a zwave network just by removing and adding a device again because the hub will only change the routing tables for immediate neighbors. Any node that is 2 or 3 hops away is unchanged. This is why add provides functionality, but the heal is needed for efficiency, if that makes sense.

A lot of commercial installations do a zwave heal every night as part of preventive maintenance. But going down for network maintenance in the middle of the night doesn’t match the use case for many residences, smoke alarms being one obvious example, but even lights and door lock logging as well.

edited to add:

Of course most commercial installations have a lot more nodes. Some residential installations never do a heal unless they’re seeing functionality problems, they really don’t care much about efficiency.

A heal can introduce problems as well, particularly if one device is near an intermittent source of interference and the heal causes that device to now be on an important route for a different device. Another reason why middle of the night network change probably doesn’t match a residential use case. I’ve mentioned the fact that my wheelchair can block signals.

The standard test problem in college network engineering classes that cover home automation is a new house where everything works perfectly, and then stops working after the owners move in. Most students know to look for baby monitors, big screen TVs (because of the metal, not the transmission), and recliners with metal frames. But hardly anyone (including me!) thinks about cast iron pans inside the kitchen cupboards! But it’s a classic intermittent issue, especially in a network with very few devices.

As @bravenel mentioned, better diagnostic tools would help a lot. Although zwave isn’t like wifi, there’s still some proprietary stuff with limited access. And then there’s the ST issue of multiple protocols in the same installation which can really complicate troubleshooting.

The only Z-Wave device that has given me trouble with SmartThings is one of my Schlage door locks. It’s the one that’s furthest from the hub. It completely stopped responding to commands at one point. I deleted it and re-added it and it has worked great ever since. I’ve added a GE/Jasco switch between the hub and door lock since the first time I paired the lock and I think that has helped.

Overall, I’ve had better luck with my Z-Wave stuff than I’ve had with Zigbee devices.

Another anecdote that has me puzzled. I see a lot of complaints about door locks being troublesome zwave devices an I have never had any trouble with mine. I agree that overall zwave probably is the most stable protocol the ST hub has at this point.

I had a home automation system from centrury link for about a week and one thing their zwave router (which was made by netgear) did have was a map of the zwave mesh. It would show a failed node if you removed the batteries for about 10 minutes and it showed the signal strength and mesh health for all of the devices.