Device Health Beta 0.24.5

The centercode doesn’t seem to make it clear how big the problem is under the current firmware
So post a list of devices that are reporting unavailable in the format below

Device Name:
Product:
Became Unavailable:
Still Function:
Battery/Mains:
Notes:

My List

Device Name: Stevens Presence
Product: SmartThings Presence Sensor
Time Unavailable: 3 Days
Still Function: Yes
Battery/Mains: Battery
Notes:

Device Name: Stevens Car
Product: SmartThings Presence Sensor
Time Unavailable: 3 Days
Still Function: Yes
Battery/Mains: Battery
Notes:

Device Name: Muiltsensor Kitchen
Product: Aeon Multisensor 6 (Running 1.11eu)
Time Unavailable: 2 Days
Still Function: Yes
Battery/Mains: Mains
Notes:

Device Name: Muiltsensor Office
Product: Aeon Multisensor 6 (Running 1.11eu)
Time Unavailable: 2 Days
Still Function: Yes
Battery/Mains: Mains
Notes:

Ok, 2 responses to this. 1) why limit it to devices that are reporting as unavailable but still functioning? What about devices that report as unavailable but not controllable? 2) why not just post in the existing beta firmware thread for 24.5? Why do we need another thread?

Changed to all devices

I feel this is a huge issue and needs its own post to get the attention it needs

3 Likes

no issues with device health for me in the beta…

I’ll be here all night updating this thread…

I did add another Centercode bug, and updated another bug for zigbee devices dropping again.

Beta 24.5 has not gone well in my opinion, and I agree this needs some serious attention right now.

1 Like

Just wanted to acknowledge we’ve seen this post and are investigating a few device health related issues. One of them that started late last night and is related to device control and reporting. This issue is mentioned on our public status page and is actively being worked on.

We’re also looking into the issue where devices are offline but controllable and reporting. At this time it doesn’t appear to be related to the beta firmware and our Device Health team is actively looking into it.

We’ve been looking into the other issues that have been reported and debugging them one by one. Thank you for the reports and assistance in tracking down issues.

4 Likes

Thanks for the update!

Any info is appreciated!

@tpmanley
Is it possible that these issues are affecting zwave direct association?

I have 3 linear aux in wall dimmers that are master, that I use in 3 different places for 3 way lights.

The behavior is almost like the CPU in the master and slave are at 100% and commands and delayed and queued up.

I can click several times let’s say on and off, there might be a delay, and then can see by the status led on the switch that it then slowly sends that many commands to the slave switch.

I used to be able to use the master switch to dim the slave lights. Now the response is so slow I can only turn the slave on and off, because holding the master to dim up or down is delayed then sets the slave to either full on or most dim.

So again on the master side, I can click on the paddle, but if I click more than once it seems to queue up the commands which I can see later going out by the flash of the status led.

Okay, I just remembered a setting that might cause something like this.

But how did all three get their settings corrupted?

I just remembered that these switches have association groups, it sends to association group 1, then waits some amount of time, and then sends to association group 2, etc.

My direct zwave association issue might be fixed. I used zwave tweaker.

Each of the 3 master switches had extra devices listed as slaves.

All the extra devices’ IDs just happened to match “ghost” devices that appeared this week that I had to request support to remove.

Each master only ever controlled 1 and only 1 slave. So it is unclear why multiple devices where listed for each target association group in each switch.

Okay, I think there are 2 scenarios.

  1. back in 2015 I accidentally added 3 devices instead of just 1 into each switch. But for over 2 1/2 years it did not matter until this week, when suddenly something changed and those extra devices cause problems

  2. those extra devices somehow got added this week, when The problem started.

  3. ??

I’ve been noticing similar behavior with Schlage locks. It hadn’t really been an issue until last night. Lock was " unavailable" when good night routine ran, so it didn’t lock. When lock came back online it ( correctly ) reported unlocked, so welcome home routine ran. Turning on lights, disarming SHM, etc.

Hi @tpmanley,

Thanks for that update. I’m starting to see behavior in online/offline statuses unlike anything I’ve seen in the past.

Take a look:

Those zwave switches never go that long for Last Activity, typically it’s an hour, and then anything past that I’d typically see the erroneous offline message.

For the first time ever I see my dimmer show up. You can also see my lock as well. The lock was manually used much less than 13 hours ago (5:25am EST to be specific).

My water valve has also been rock solid for months, and now it’s gone wonky.

Something has to have very recently changed, but not for the good unless something else regarding last activity and offline/online status are going to be used/displayed differently than in the past.

EDIT: @tpmanley

So 5 hours later I’m trying to get the Foyer Lights working…

In the app (classic) I can turn it on, but not off because the state doesn’t update, BUT Alexa will turn it off - followed by the message “Foyer Lights isn’t responding. Please check network connection and power supply”. Once she turns it off, I can turn it on again via the app, but not off. This is easily repeatable.

I fixed this problem by using the air gap switch. Once I did that, it started working again:

All the other devices in the first pic above started working on their own, but now I have new offline devices down for a few hours I should track down…

EDIT: I just had to do the same process on another device, Steph’s Ceiling Fan:

1 Like

@tpmanley

Sorry to interject here, but with a good number of posts and complaints and confusion in the past week or so, this seems like the best thread to ask about this:

What is SmartThings stance on Device Health support for devices using custom device handler code?

1 Like

I’m not sure I understand what you mean by stance. Are you asking if custom device handlers can use device health?

Exactly. Sorry if the question wasn’t clear. :slight_smile:

And if custom device handlers can use device health, where is documentation on how to properly implement it?

There isn’t user-facing documentation yet.

Previously I posted this:

Device Health utilizes a new capability Health Check which was added to most official device handlers. For example:
SmartThingsPublic/devicetypes/smartthings/smartsense-open-closed-sensor.src/smartsense-open-closed-sensor.groovy

This capability uses the state checkInterval to track the device’s health. In the device handler above, the device is checked every 12 minutes (60 * 12) which displays as 720 in the IDE. This device is polled every 5 minutes so 12 minutes allows for two missed checks and a small buffer.

and Jim added this:

In case you look at other Device Handlers, you might see that certain devices that support the “Health Check” capability (which defines the checkInterval attribute) don’t actually send a checkInterval event. There is currently inconsistency in how the device status is reported across devices, with some devices using checkInterval and others using a different event.

The plan is to consolidate and make the usage consistent, but in the meantime just don’t get too confused if you see different devices doing it differently :wink:

I’ve followed the above and some reverse engineering from the public GitHub and added device health to some of my more import custom handlers. Works well.

Okay, so does that mean Health Check is not yet supported for use in custom device handlers?

Without documentation, looking at official device handler code, it appears there are two elements: the checkInterval event that sets up the time interval for device health checks, and the ping function that uses some kind of read command (e.g., readAttribute if a ZigBee device) for performing the actual check. Is that correct?

Also, I’m wondering: If (for whatever reason) a device doesn’t support a read command as a polling method, then does that mean Health Check should not be used with it?

Well as Jimmy mentioned above, it is possible to incorporate it into custom device handlers. So I wouldn’t say it isn’t supported. It just isn’t documented so that it can be easily incorporated by community developers.

I spoke with a dev more familiar with device health and he said for Zigbee/Z-Wave DTHs, you should use the Health Check capability and a ping function. For cloud-to-cloud/LAN devices, there are some enrollment/status events that should be included in the initialize method. Similar to this:

4 Likes