[ST Edge] Driver Deep Sleep?

I have been noticing a consistent behavior in each of the LAN drivers I am developing. If no http requests are made for an extended period of time (e.g. overnight) then the first request takes a very long time to be made (3-8 seconds). If I trigger additional events, they are all sub second and stay that way while hitting the driver.

I am guessing that the underlying socket layer is reusing tcp connections to keep it fast during quick calls, but the initial delay is confusing.

@nayelyz / @robert.masen Do you know if drivers/hub go into some sort of deep sleep? Is there anything we can do to encourage connection reuse for http requests? Does something like keep-alive apply to edge drivers?

3 Likes

All my lan driver does that since a few days

2 Likes

Is that coming from actions within the ST app that trigger http request?

Scheduled HTTP request seems to be fine.

1 Like

Yes, from actions and from button presses within the ST app. It’s very reproducible for me. Maybe the delay is coming from the cloud calling back to the edge driver. Not sure where it comes from, just that it goes away while the driver is actively used.

2 Likes

We do not put drivers into any kind of a deep sleep at this time. I would imagine it is being caused by the length of timeout being set on a socket and the lack of tcp keep-alive. One option you could try would be to reduce the socket timeout by calling settimeout on your cosock socket with a small number like 0.5 or 1. If you receive multiple timeouts in a row, treat that as a closed error and re-connnect the socket. We are hoping to add tcp keep-alive soon however we don’t have an official timeline.

2 Likes

I have been trying things out to try and eliminate this delay. Do you think your luncheon library would do the trick since we can provide it the socket we want to use? Can we just create a shared request socket and use it for all of the luncheon requests so it doesn’t get closed? Not sure what the implications would be with multiple devices pushing that socket.

Do you think your luncheon library would do the trick since we can provide it the socket we want to use

Luncheon is primarily concerned with serializing and deserializing http messages, so if you are using http then it could work. The one note is that we still don’t have support for chunk encoding in Luncheon so if your response bodies are chunk encoded, you will need to provide your own source which would handing that encoding.

Not sure what the implications would be with multiple devices pushing that socket.

So long as the server is capable of handling that kind of behavior then you should be fine. This would require coordinating an http keep-alive with server but you should be able to make Luncheon send multiple Requests on the same socket

@robert.masen I am now seeing ~30 second delays for the first REST call after a long downtime. I see this on all LAN drivers that use REST. After this call is made, further quick calls are executed as expected. This time noticeably increased with the latest firmware update.

Hey Blue,

a long downtime

Can you provide a general duration you see this happen at starting with, for example after 5 minutes the request time jumps, or after 1 hour?

Also can you classify

LAN drivers that use REST

Is this with socket.http or is it with Luncheon? Are you creating a new tcp socket for each request (default for socket.http) or are you re-using the same socket across multiple requests? Does an error come occur between the long request and the subsequent short requests?

Sorry for the delayed response. I haven’t nailed down a firm time, but It seems like a hour or so of downtime results in a long delay on the next call.

I am seeing these delays for drivers that are using http = cosock.asyncify 'socket.http'. I have experimented with Luncheon but have not tried it for a real driver yet. As of right now, I am letting socket.http create and manage the socket.

There are no errors that come across. There is simply a delay. I can see the event come into the driver, then there is a long delay, then the http request goes to the device.

This past weekend I was able to setup a test that would simulate what you are describing. Using my self-service driver, I setup a program that would send out a device event to trigger the call to http.request after an increasing delay.

I was able to see less than 1 seconds request/response completion after 1 hour, 2 hours and 4 hours w/o any activity in the driver.

Could the issue be on the server side of things? Are you seeing the incoming request delayed on that side?

Is this an https url or just an http? If it is an http url, you can use the create argument to http.request to track the send and receive calls, something like:

http.request {
    url = url,
    create = function()
        local tcp = cosock.socket.tcp()
        local _send = tcp.send
        local _recv = tcp.receive
        tcp.send = function(...)
            print("sending")
            return _send(...)
        end
        tcp.receive = function(...)
            print("receiving")
            return _recv(...)
        end
        return tcp
    end
}

I’m seeing it for both https and http. I will setup a similar test for the drivers and see what I find. Thanks.

To be more specific. I am seeing a delay from the time I press a button in the house (local automation) or from within the ST app. So it is possible that the delay is elsewhere in the chain.

I’m also seeing a delay in the automations or from with the ST app after few hours. After the first slow http request to the device, subsequent http requests to the devices are sub-second.

I’m not sure whether it is due to deep sleep or not because I have scheduled (every 1-5min) http requests to the devices to pull states, and they are working fine.

I just want to reiterate this from above, we do not, at this point in time, put driver into any kind of a “deep sleep”

@robert.masen I did some deeper trace logging and it looks like it isn’t in the http requests themselves but something else in the driver framework:

2022-07-21T21:05:38.162615041+00:00 TRACE Bond Bridge LAN  Received event with handler capability
2022-07-21T21:05:42.684033375+00:00 INFO Bond Bridge LAN  <Device: XXXX (Office Fan)> received command: {"args":{},"capability":"switch","command":"on","component":"light","positional_args":{}}
2022-07-21T21:05:42.690687042+00:00 TRACE Bond Bridge LAN  Found CapabilityCommandDispatcher handler in bond -> fan
2022-07-21T21:05:42.696120709+00:00 DEBUG Bond Bridge LAN  TIME capabilities.switch.commands.on:<Device: XXXX (Office Fan)>
2022-07-21T21:05:42.701356042+00:00 DEBUG Bond Bridge LAN  TIME Light.on:<Device: XXXX (Office Fan)>
2022-07-21T21:05:42.711830042+00:00 DEBUG Bond Bridge LAN  TIME Light.on:get_subnet_dni:<Device: XXXX (Office Fan)>
2022-07-21T21:05:43.119969042+00:00 DEBUG Bond Bridge LAN  TIME Light.on:put_device_action:<Device: XXXX (Office Fan)>

You can see the initial 4.5 seconds delay between when the event is received to when the event is passed to my handler. After that, things progress fairly quickly. The http request took 300-400 ms.

You can see an immediate reexecution of that event results in a fraction of a second between the first 2 calls.

2022-07-21T21:21:14.449441926+00:00 TRACE Bond Bridge LAN  Received event with handler capability
2022-07-21T21:21:14.488935259+00:00 INFO Bond Bridge LAN  <Device: XXXX (Office Fan)> received command: {"args":{},"capability":"switch","command":"on","component":"light","positional_args":{}}

I can’t get visibility in between those two log entries created by the framework. That is occurring before the handler is executed.

This is what had me initially asking about a “deep sleep” as it seems like something is taking time to be spun up in the driver and then is cached for some amount of time thereafter.

Thanks for this extra info, I’ll see if I can’t figure out what might be happening

1 Like

Not sure if it is relevant, but these events are being sent to sub-components of the driver. I haven’t tried it yet with the default component.

I doubt it would matter, looking over the internals again, the only thing I can see that would cause this kind of delay is additional work that might be happening on the device thread. Each device is assigned its own co-routine when created that work can be performed on, any of the command_handlers will be executing on this device thread. As a consequence to this, new work cannot be dispatched to a device thread until that device thread is free meaning any yields being performed on that device thread (especially sleep) will delay the handler from being executed.

To solve for this you have a couple of options, one would be to use the st.thread module to create a new thread for any long running work you might want to perform, a slightly lighter weight option would be using cosock.spawn which is what st.thread would be using under the hood.

Do you by chance have a polling loop somewhere in your driver that might be working on the device thread?