Information Only: ST not working 100%. Heal zigbee network


(Bob) #1

Today I had zigbee devices dropping off and in general things had become erratic. Whatever I did nothing would help the situation. (Re-paired devices, kicked them, told them to get their act together, etc, etc).
I have done the following which has worked. So far!!! :slight_smile:
Reboot hub. Things still remained the same.
Checked my zigbee channel which is 20.
Checked my 2.4GHz channel which is 1 and set to always be that.
No conflict there.
Removed power to hub, (no batteries inserted).
Waited 30 minutes.
Re-connected power.
This will cause the zigbee devices to rebuild the mesh from scratch.
Everything came online OK and everything is working OK.
It’s been 3 hours now so fingers crossed.

Now I’m not saying this is the fix for you, just putting this out there as something that you might like to try if you are getting issues.

Disclaimer. Don’t blame me if this doesn’t work for you. It’s something I tried as a last resort and it did work for me. (At the moment).


#2

Yep. :sunglasses:

Taking the hub off-line for at least 15 minutes and then bringing it back while leaving all of the other zigbee devices on power the whole time will cause all of the individual zigbee devices to rebuild their neighbor tables. The equivalent of the Z wave “repair” utility.” You may not see full improvements until the next day, so be patient.

Technically, that’s different than rebuilding the network, which would require removing devices and adding the back again. The method you describe does what zigbee calls a “heal“ which is intended to fix any broken message paths without having to address individual devices.

It’s one of those “can’t hurt, might help“ kind of things. :electric_plug:


(Bob) #3

Thanks for the additional clarification.
It may just be a fluke but just thought I’d put it out there.


#4

It’s almost always worth a try when things have just gone flaky.

One of the things that the heal will do is if you have gotten new local interference which changes which repeaters Can reach which end devices, then rebuilding the neighbor tables will cause everyone to select their best parent for the current circumstances. It’s also the fastest way to cause your existing devices to use any new repeaters that you have added. :sunglasses:


(Ray) #5

You will probably know the answer. When the hub is offline. How do the devices know which repeater to go to and which mesh to build? Don’t they need to know the location of the hub first?


#6

They won’t start rebuilding until the hub comes back online. :sunglasses: Once the hub comes back online, all the devices come out of panic mode but they understand that something has happened so that’s when they rebuild their neighbor tables.


#7

I wish it were that easy and painless. After attempting a ZigBee heal, many of my devices just wouldn’t reconnect and needed a battery pull.


#8

I love the term panic mode. All I can see is my little Iris motion sensors completely bugging out chattering in a language we can’t hear.


(Ray) #9

I feel your pain there. I have 100+ ZigBee devices and mentioning about no hub power scared me. I have an APC backup power that will last weeks just for the hub.


(Kirk Hilzinger) #10

Got home after sunset when one routine ran that included Zigbee devices. At some point after that routine ran, I lost communication with all Zigbee devices. I tried your work-around and it seems to have restored them, though I am waiting for a few battery devices to finish showing up.


(PPO16) #11

I am thinking to regularly “sanitize” my mesh. So ideally, just a Hub radio off on the ZigBee/Zwave transceivers the hub uses would to the same as the power off we all do when things go wrong.
Does anyone know an API for that we could call like once a week?
Since the Hub will continue running, we could wait 15min and resume radio.

Just like a flightmode… :wink:

Thanks


(Luigi Semenzato) #12

Is this “zigbee heal” process really a good suggestion? The zigbee network is supposed to be self-healing, for instance check this Quora answer

https://www.quora.com/Is-it-possible-to-make-a-self-healing-ZigBee-network

which seems to be from a knowledgeable source. The answer does not, unfortunately, report the frequency at which the network can reconfigure, but I would expect it to be happen at least within minutes. Does anybody know?

Of course it’s also possible that the network is self-healing in theory, but there are bugs on specific devices that prevent that from happening.


#13

It’s one of those “can’t hurt, might help” kinds of suggestions.

While zigbee mesh is self healing, it doesn’t mean quite the same thing as the full zigbee heal, Which is what happens when the coordinator (hub) is powered down.

“Self healing” in this context simply Means that if a preferred message route is unavailable, the sending node will try an alternate route. There are multiple steps involved in all of that, but that’s really all it means. No human intervention is required in order to get a message to take a different path. And this process, as you guessed, is very quick, usually less than a minute.

The problem comes in educating the nodes as to what potential paths are available, and the depth of the network at each repeater (Different model devices support different numbers of children, so if a parent that could support five children goes bad and the only alternative available only supports three children, you can be left with “orphans“. )

If you rely solely on zigbee self healing, It’s quite possible to end up with “orphan nodes” which know they belong to this network, but which have no parent to repeat messages for them, even though there are parents with available child slots. :disappointed_relieved:

Also, in some cases when potential parents were added to the network after the children already existed, those parents end up never being considered as possible substitutions. This just gets complicated.

When you temporarily remove the hub (the coordinator) from a zigbee network, The other devices assume that the network is being physically moved and that none of the previous paths can be relied upon. Consequently, when the coordinator comes back, everybody starts over in building their preferred paths. This will generally pick up any orphans and also improve efficiency if new parents are available who would be a better first choice than the old ones.

So in this sense “healing“ (sometimes called “forced healing”) Is a much more thorough process and allows every end device to consider every possible repeater as a parent candidate. This process can take much longer than a self heal, Sometimes hours, to complete, so field techs usually don’t check for the full results until the next day.

Self-healing just occurs when a sender tries to get a message through, the message doesn’t get through, and the sender asks itself “hmmm, do I know of any other paths to that destination?“ self-healing does not normally assign new parents to orphan devices.

So self-healing is a good thing, and helps keep the network operating When there are temporary problems along an individual message route such as local interference. But a full heal Essentially rebuilds the whole network and causes more efficient parent/child assignment, as well as picking up any orphans.

I’m sorry, I don’t know of any really simple explanation for this, but here’s A good technical paper which describes the difference between what happens when the coordinator (hub) goes off-line versus what happens when an individual repeater like a light switch goes off-line.

The one thing to note is that in their example it’s a sensor net which is choosing a new coordinator. In the zigbee home automation profile, which is what smartthings uses, there’s only one potential coordinator, a hub, and it’s either there or it isn’t. But the distinctions between the types of failures are similar, particularly with regard to orphan nodes.

The following paper was published in a peer reviewed journal, the International Journal of Computer Science Engineering and Technology (IJCSET)

http://www.ijcset.net/docs/Volumes/volume3issue3/ijcset2013030309.pdf

This is the key distinction

When a device is found orphaned {during a forced heal}, a realignment or a channel re-scan process will be invoked.

That will happen only when the coordinator (hub) is powered off for at least 15 minutes and then on again.It does not happen in self healing.


(Luigi Semenzato) #14

Thank you, this is very helpful.

It’s a pity that the Smartthings hub doesn’t have an interface for doing exactly this then. Having to remove the batteries isn’t a nice UI.