Firmware Update Causing new fatal cosock Error

@nayelyz @robert.masen

I have a user of one of my Edge drivers that has been running with no problems up until he received the new firmware update to his hub. The driver is now experiencing a fatal error, seemingly originating from cosock.

The log below shows the error, which appears to happen following a sequence of received socket messages. Note that although the log messages refer to ‘HTTP’, the HTTP library is not being used. The data receive loop occurs inside a separate thread:register_socket routine.

2022-06-07T21:00:59.885006207+00:00 DEBUG ONVIF Video Camera V1  Received HTTP header:  POST /event HTTP/1.1
2022-06-07T21:00:59.912726207+00:00 WARN ONVIF Video Camera V1  Received message for IPCam #1 (configure!) ignored (topic=tns1:Device/tnsaxis:Status/Temperature/Below)
2022-06-07T21:00:59.953065874+00:00 DEBUG ONVIF Video Camera V1  Received HTTP header:  POST /event HTTP/1.1
2022-06-07T21:00:59.981014874+00:00 WARN ONVIF Video Camera V1  Received message for IPCam #1 (configure!) ignored (topic=tns1:Device/tnsaxis:IO/VirtualInput)
2022-06-07T21:01:00.013081207+00:00 DEBUG ONVIF Video Camera V1  Received HTTP header:  POST /event HTTP/1.1
2022-06-07T21:01:00.049645540+00:00 WARN ONVIF Video Camera V1  Received message for IPCam #1 (configure!) ignored (topic=tns1:Device/tnsaxis:IO/VirtualInput)
2022-06-07T21:01:00.076419874+00:00 DEBUG ONVIF Video Camera V1  Received HTTP header:  POST /event HTTP/1.1
2022-06-07T21:01:00.106395207+00:00 WARN ONVIF Video Camera V1  Received message for IPCam #1 (configure!) ignored (topic=tns1:Device/tnsaxis:IO/VirtualInput)
2022-06-07T21:01:00.233038874+00:00 DEBUG ONVIF Video Camera V1  Received HTTP header:  POST /event HTTP/1.1
2022-06-07T21:01:00.435142207+00:00 WARN ONVIF Video Camera V1  Received message for IPCam #1 (configure!) ignored (topic=tns1:Device/tnsaxis:IO/VirtualInput)
2022-06-07T21:02:00.623653703+00:00 FATAL ONVIF Video Camera V1  RPC Status: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {"date": "Tue, 07 Jun 2022 21:01:59 GMT"} }
Traceback:
stack traceback:
        [C]: in function 'coroutine.yield'
        [string "coroutine"]:9: in function 'wrapped_coroutine_resume'
        [string "cosock.lua"]:222: in field 'run'
        [string "st/driver.lua"]:737: in method 'run'
        [string "init.lua"]:876: in main chunk

Thanks for sharing this, Todd. Can you provide the driverID, please?
If we need more info, we’ll let you know.

Driver ID 1d68e5c3-03c0-4229-985d-14e225550049

As I stare at it some more, I see the error actually happened a full second after the last received data, so not sure it is related to the receive thread or not.

EDIT: No I think it IS related to the receive thread. Looking back at other logs for this particular camera, there can be up to 40 messages received at this time, so it looks like things crapped out after just the first 5.

Ok, thanks for the extra information. has this error been reported by a single user or you are getting it as well?

I have not reproduced it, so just one user reporting so far.

And I should say I’ve never seen this error before!

I just wanted to let you know I am looking into this.

Without getting into too much of the inner workings of our platform, a new hard time limit was introduced to some of the operations as part of our code sandbox. The limit was set to 60 seconds and the consequence of exceeding that number would be to have the driver restart. This limit was added after the bugs you helped us fix in the TLS socket implement, in hopes that we could discover bugs like that more quickly. It looks like at the very least having some more information about which operation failed would be good.

I can say for sure that this limit would not apply to any receive calls but potentially to either send or connect calls on a socket. It also would apply to things like Device:emit_event or Driver:try_create_device. The initial thinking is that 60 seconds would be more than enough time for any of those operations so exceeding that would mean we have a larger problem we need to consult.

Just out of curiosity, since it seems like this driver is working with video, are you sending large amounts of data across a socket?

Thanks for the update.

In this case I don’t think there are any socket send calls going on at the point of the error so it must have been one of the socket connect requests that was received from the device. There is no video streaming occurring at this point. This happened when the driver had just subscribed to motion events from the camera, and the camera sends back a burst of 40 or so ONVIF XML messages (requiring 40 connect/receive sequences) with various event states. This device is somewhat unique in that respect; most cameras only send back a handful of event state updates.

As most of these event messages don’t contain the motion event that the driver is looking for they are simply ignored once received. Only when the target motion event is seen is any kind of emit_event occurring.

Definitely no try_create_device is happening at this point.

40 connect/receive sequences

That seems like a lot of calls to both socket.tcp and skt:connect, is it done in a sequence or is it done with parallelism? I will look over these two code paths to see if anything stands out to me. At the very least I am going to add some work to the next release to provide more information on these timeouts.

This error occurred after only 5, but I know there would have been many more ‘in the queue’.

Maybe it is the way my handler is structured?:

  • during driver init, a thread is created (Thread.Thread()) and a socket handler registered via thread:register_socket
  • Upon incoming connect request, my handler accepts the connection, and then then issues a cosock.spawn() to receive & parse the XML data, and ultimately close the connection.

I suspect that the issue is skt:accept if this is a server socket.

1 Like

yes, it is a server socket