[ST Edge] Driver Deep Sleep?

blueyetisoftware · July 25, 2022, 9:54pm

The only polling in either driver is done via device.thread:call_on_schedule. Both drivers integrate with a 3rd party hub via LAN (Bond, Hue) and import connected devices that are manipulated via REST calls to the hubs. These scheduled calls are done on the hub device themselves, not the individual devices that are imported. I wouldn’t expect polling done on the hub device to influence calls made on another device. There are also occasional delayed calls made on the device thread, but those are transient and issued with device.thread:call_with_delay. Lastly, I am registering socket callbacks on the device thread using device.thread:register_socket and that is also only done on the hub devices.

blueyetisoftware · July 25, 2022, 10:01pm

For more context the Bond driver sends a network ping once a minute via UDP. That is what is being scheduled via call_on_schedule, any other “polling” actually takes place in the callbacks for register_socket in response to the incoming packets.

The Hue driver is doing true polling (checking system state) every 5 minutes to stay in sync with changes made outside of ST. This will be removed once Server-Sent-Events are supported by ST.

In both cases, the work being done inside of call_on_schedule is minimal. Does any of this sound like it would contribute to this slow spin up of the driver?

robert.masen · July 25, 2022, 10:34pm

if either device.thread:call_with_delay or device.thread:register_socket is getting passed a function that will yield it could be the culprit. One place that can sometimes be confusing is that device.thread:register_socket will only indicate a single receive can be called, however with network sockets this can be a bit confusing since a call to receive can either be ready for *a or *l or a number and there may not be enough data ready to fulfill any of those which would result in a yield.

This clearly seems like another ergonomics issue regarding some parts of our driver framework, I will if we can try and address those. For the particular case you are trying to avoid, I am curious if you were to have your register_socket callback wrap the function body in cosock.spawn if you see this delay disappear?

blueyetisoftware · July 25, 2022, 11:13pm

To keep things simple, I will stick with the design of the Bond driver:

Scheduled UDP ping once per minute
register_socket callback is called in response to the ping
a. sock:receivefrom() called to receive data which has the device ID
b. device online/offline state is updated

Asynchronously, we get additional packets from the device outside of the ping. These occur only went there are device events for the connected devices (e.g. on/off). Those also show up on this UDP socket and can contain state data for the device. In those cases, the device state in ST is emitted to match.

I will try to wrap them, but the scheduled work is very small. The packet resulting from the ping is tiny.

blueyetisoftware · July 26, 2022, 12:27am

How would I go about using spawn for this?

device.thread:register_socket(sock, function()
  local data, ip, port = sock:receivefrom()
  handle_bpup_message(driver, device, ip, port, data)
end, string.format('bpup callback %s', device:pretty_print()))

I tried this, but am getting an error

device.thread:register_socket(sock, function()
  cosock.spawn(function()
    local data, ip, port = sock:receivefrom()
    handle_bpup_message(driver, device, ip, port, data)
  end, "cosock-receive")
end, string.format('bpup callback %s', device:pretty_print()))

The error is

2022-07-26T02:07:46.167389948+00:00 FATAL Bond Bridge LAN  Lua: runtime error: [string "cosock/socket/internals.lua"]:101: recvr waker already set, sockets can only block one thread per waker kind
stack traceback:
        [C]: in function 'assert'
        [string "cosock/socket/internals.lua"]:101: in method 'setwaker'
        [string "cosock.lua"]:236: in field 'run'
        [string "st/driver.lua"]:737: in method 'run'
        [string "init.lua"]:139: in main chunk

blueyetisoftware · July 26, 2022, 8:34pm

@robert.masen I went ahead and just removed the scheduled calls from both drivers, temporarily eliminating my ability to ping the hubs. I’ll run with this for a couple days and report back.

In the meantime, can you clarify a couple of things:

Does each device get its own thread? My drivers import a large number of devices and I had assumed that there was a different device thread for each.
If I schedule something on the thread of Device A, would you expect it to impact Device B? If these calls are the issue, that would indicate that they do effect each other.
Also, if it is these scheduled calls, why do you think it would only be slow after an extended downtime? Wouldn’t it always be slow?

robert.masen · July 27, 2022, 9:45pm

How would I go about using spawn for this?

cosock.spawn(function()
  while true do
    local data, ip, port = sock:receivefrom()
    handle_bpup_message(driver, device, ip, port, data)
  end
end, string.format('bpup callback %s', device:pretty_print()))

That would achieve the same thing as what register_socket is doing.

Does each device get its own thread?

Yes, each device gets its own Thread, which maps to a lua coroutine, which isn’t the same thing as an OS thread. The limitation here being that one 1 coroutine can be executing at a time and if a coroutine yields, it has to wait until it is selected to run again.

If I schedule something on the thread of Device A, would you expect it to impact Device B?

The only way one thread could impact another thread is by potentially being very busy w/o yielding. This isn’t probably likely what is happening for your driver.

Also, if it is these scheduled calls, why do you think it would only be slow after an extended downtime? Wouldn’t it always be slow?

This is a very good point, I had initially thought that we might be hitting a bit of a perfect storm where the same thread was being used for a poll loop and we just happened to be hitting the tail end of a sleep but raising this now makes me think you might be experiencing some extra delay because of the bug uncovered here.

Essentially we are leaking some memory associated with each socket created after it gets garbage collected and we end up performing the cleanup only when a new device event comes through. We have a bugfix in line for the next release so hopefully this will solve your problem.

blueyetisoftware · July 27, 2022, 10:00pm

That sounds like a very promising lead. I’ll keep an eye on that fix and continue profiling with these thread modifications.

blueyetisoftware · July 28, 2022, 4:08am

I have continued testing this without the scheduled calls and continue to see these delays, so I am leaning toward another cause such as the garbage collection issue you described. It also explains why I have seen it as bad as 30 seconds in the past.

blueyetisoftware · July 28, 2022, 6:02am

@robert.masen As for the leak, I have thought about this more. I am only creating sockets in the device init for the hub devices. As an example, the Bond driver integrates with the Bond Bridge and imports lights, shades, fans, fireplaces, etc that are connected to that Bond Bridge. So I end up with 1 Bridge and multiple other devices that are managed by the ST driver. Only the Bridge device creates a socket, and it does that in the init for that device, so I don’t see how that leak would be impacting me either. The same socket is used throughout.

robert.masen · July 28, 2022, 2:08pm

Any calls to http.request will create and then destroy a socket, so that might account for the extra overhead?

blueyetisoftware · July 28, 2022, 4:53pm

Yes, of course. I neglected that part. Many http requests are made. Been staring at this code too long

Thanks again

blueyetisoftware · July 28, 2022, 6:34pm

@robert.masen Ok, I think this is actually a perfect explanation for the downtime. Overnight, the driver is making http requests without user events. So we are creating a big backlog of requests to be cleaned up when I wake up and hit the switch (aka. cause an event) for the first time. I am going to setup a simple automation to cause regular events in the driver to verify the behavior.

blueyetisoftware · August 2, 2022, 5:03pm

@robert.masen I created a routine to test out this theory and it seems to have panned out so far. I used a motion sensor that gets a lot of traffic and connected it to a dummy bulb so I could generate frequent events in the driver, and force garbage collection. So far, the driver is much more responsive and I haven’t noticed the difference after downtime as long as I trigger the motion sensor first.

Since the motion sensor is always getting triggered, I don’t technically have much downtime, but this looks like it has worked for both drivers so far.

cc: @K_White You may want to give this a shot as well

blueyetisoftware · August 24, 2022, 7:53pm

Confirmed that the 44.09 hub firmware fixed this. Thanks again.

Aeotec Smart Home Hub/2018/2015 Model Hub Firmware Release Notes - 0.44.09 Hub Firmware Release Notes

We will begin rolling out Samsung SmartThings Hub firmware version 0.44.09 starting Monday, August 22. Once downloaded, your Hub will briefly go offline as it reboots and applies the update; most customers will experience less than one minute of downtime. We will update the status page when the release is complete. Hub Target: Samsung SmartThings Hub 2015 (Hub v2) Samsung SmartThings Hub 2018 (Hub v3) Aeotec Smart Home Hub Release Date: 22 August 2022~ 26 August 2022 Note that this release may be spread out over a week, so you may not see your Hub update the first day of the release period. Release Notes General Add recovery mechanisms for degraded system performance Other minor bug fixes and improvements Edge Drivers Fixed a bug that was preventing cosock.channel from timing o…

Topic		Replies	Views
[SmartThings Edge] Issue with HTTP requests in LAN drivers Writing Edge Drivers	50	2512	May 18, 2022
Edge Driver HTTP Close when Call HTTP Request Writing Edge Drivers	41	3174	May 4, 2022
[ST Edge] v43 Http Socket Timeout Writing Edge Drivers	31	1830	October 20, 2022
SmartThings Edge Developer Beta \| Known Issues and Bug Tracking Support edge-device	103	13437	October 14, 2022
Edge Drivers Spontaneously Restart Writing Edge Drivers edge	31	1290	September 13, 2025

[ST Edge] Driver Deep Sleep?

Related topics