[ST Edge] Driver Deep Sleep?

The only polling in either driver is done via device.thread:call_on_schedule. Both drivers integrate with a 3rd party hub via LAN (Bond, Hue) and import connected devices that are manipulated via REST calls to the hubs. These scheduled calls are done on the hub device themselves, not the individual devices that are imported. I wouldn’t expect polling done on the hub device to influence calls made on another device. There are also occasional delayed calls made on the device thread, but those are transient and issued with device.thread:call_with_delay. Lastly, I am registering socket callbacks on the device thread using device.thread:register_socket and that is also only done on the hub devices.

For more context the Bond driver sends a network ping once a minute via UDP. That is what is being scheduled via call_on_schedule, any other “polling” actually takes place in the callbacks for register_socket in response to the incoming packets.

The Hue driver is doing true polling (checking system state) every 5 minutes to stay in sync with changes made outside of ST. This will be removed once Server-Sent-Events are supported by ST.

In both cases, the work being done inside of call_on_schedule is minimal. Does any of this sound like it would contribute to this slow spin up of the driver?

if either device.thread:call_with_delay or device.thread:register_socket is getting passed a function that will yield it could be the culprit. One place that can sometimes be confusing is that device.thread:register_socket will only indicate a single receive can be called, however with network sockets this can be a bit confusing since a call to receive can either be ready for *a or *l or a number and there may not be enough data ready to fulfill any of those which would result in a yield.

This clearly seems like another ergonomics issue regarding some parts of our driver framework, I will if we can try and address those. For the particular case you are trying to avoid, I am curious if you were to have your register_socket callback wrap the function body in cosock.spawn if you see this delay disappear?

To keep things simple, I will stick with the design of the Bond driver:

  1. Scheduled UDP ping once per minute
  2. register_socket callback is called in response to the ping
    a. sock:receivefrom() called to receive data which has the device ID
    b. device online/offline state is updated

Asynchronously, we get additional packets from the device outside of the ping. These occur only went there are device events for the connected devices (e.g. on/off). Those also show up on this UDP socket and can contain state data for the device. In those cases, the device state in ST is emitted to match.

I will try to wrap them, but the scheduled work is very small. The packet resulting from the ping is tiny.

How would I go about using spawn for this?

device.thread:register_socket(sock, function()
  local data, ip, port = sock:receivefrom()
  handle_bpup_message(driver, device, ip, port, data)
end, string.format('bpup callback %s', device:pretty_print()))

I tried this, but am getting an error

device.thread:register_socket(sock, function()
  cosock.spawn(function()
    local data, ip, port = sock:receivefrom()
    handle_bpup_message(driver, device, ip, port, data)
  end, "cosock-receive")
end, string.format('bpup callback %s', device:pretty_print()))

The error is

2022-07-26T02:07:46.167389948+00:00 FATAL Bond Bridge LAN  Lua: runtime error: [string "cosock/socket/internals.lua"]:101: recvr waker already set, sockets can only block one thread per waker kind
stack traceback:
        [C]: in function 'assert'
        [string "cosock/socket/internals.lua"]:101: in method 'setwaker'
        [string "cosock.lua"]:236: in field 'run'
        [string "st/driver.lua"]:737: in method 'run'
        [string "init.lua"]:139: in main chunk

@robert.masen I went ahead and just removed the scheduled calls from both drivers, temporarily eliminating my ability to ping the hubs. I’ll run with this for a couple days and report back.

In the meantime, can you clarify a couple of things:

  1. Does each device get its own thread? My drivers import a large number of devices and I had assumed that there was a different device thread for each.
  2. If I schedule something on the thread of Device A, would you expect it to impact Device B? If these calls are the issue, that would indicate that they do effect each other.
  3. Also, if it is these scheduled calls, why do you think it would only be slow after an extended downtime? Wouldn’t it always be slow?

How would I go about using spawn for this?

cosock.spawn(function()
  while true do
    local data, ip, port = sock:receivefrom()
    handle_bpup_message(driver, device, ip, port, data)
  end
end, string.format('bpup callback %s', device:pretty_print()))

That would achieve the same thing as what register_socket is doing.

Does each device get its own thread?

Yes, each device gets its own Thread, which maps to a lua coroutine, which isn’t the same thing as an OS thread. The limitation here being that one 1 coroutine can be executing at a time and if a coroutine yields, it has to wait until it is selected to run again.

If I schedule something on the thread of Device A, would you expect it to impact Device B?

The only way one thread could impact another thread is by potentially being very busy w/o yielding. This isn’t probably likely what is happening for your driver.

Also, if it is these scheduled calls, why do you think it would only be slow after an extended downtime? Wouldn’t it always be slow?

This is a very good point, I had initially thought that we might be hitting a bit of a perfect storm where the same thread was being used for a poll loop and we just happened to be hitting the tail end of a sleep but raising this now makes me think you might be experiencing some extra delay because of the bug uncovered here.

Essentially we are leaking some memory associated with each socket created after it gets garbage collected and we end up performing the cleanup only when a new device event comes through. We have a bugfix in line for the next release so hopefully this will solve your problem.

That sounds like a very promising lead. I’ll keep an eye on that fix and continue profiling with these thread modifications.

I have continued testing this without the scheduled calls and continue to see these delays, so I am leaning toward another cause such as the garbage collection issue you described. It also explains why I have seen it as bad as 30 seconds in the past.

@robert.masen As for the leak, I have thought about this more. I am only creating sockets in the device init for the hub devices. As an example, the Bond driver integrates with the Bond Bridge and imports lights, shades, fans, fireplaces, etc that are connected to that Bond Bridge. So I end up with 1 Bridge and multiple other devices that are managed by the ST driver. Only the Bridge device creates a socket, and it does that in the init for that device, so I don’t see how that leak would be impacting me either. The same socket is used throughout.

Any calls to http.request will create and then destroy a socket, so that might account for the extra overhead?

Yes, of course. I neglected that part. Many http requests are made. Been staring at this code too long :slight_smile:

Thanks again

@robert.masen Ok, I think this is actually a perfect explanation for the downtime. Overnight, the driver is making http requests without user events. So we are creating a big backlog of requests to be cleaned up when I wake up and hit the switch (aka. cause an event) for the first time. I am going to setup a simple automation to cause regular events in the driver to verify the behavior.

@robert.masen I created a routine to test out this theory and it seems to have panned out so far. I used a motion sensor that gets a lot of traffic and connected it to a dummy bulb so I could generate frequent events in the driver, and force garbage collection. So far, the driver is much more responsive and I haven’t noticed the difference after downtime as long as I trigger the motion sensor first.

Since the motion sensor is always getting triggered, I don’t technically have much downtime, but this looks like it has worked for both drivers so far.

cc: @K_White You may want to give this a shot as well

1 Like

Confirmed that the 44.09 hub firmware fixed this. Thanks again.

2 Likes