Firmware Update Causing new fatal cosock Error

@nayelyz @robert.masen

I have a user of one of my Edge drivers that has been running with no problems up until he received the new firmware update to his hub. The driver is now experiencing a fatal error, seemingly originating from cosock.

The log below shows the error, which appears to happen following a sequence of received socket messages. Note that although the log messages refer to ‘HTTP’, the HTTP library is not being used. The data receive loop occurs inside a separate thread:register_socket routine.

2022-06-07T21:00:59.885006207+00:00 DEBUG ONVIF Video Camera V1  Received HTTP header:  POST /event HTTP/1.1
2022-06-07T21:00:59.912726207+00:00 WARN ONVIF Video Camera V1  Received message for IPCam #1 (configure!) ignored (topic=tns1:Device/tnsaxis:Status/Temperature/Below)
2022-06-07T21:00:59.953065874+00:00 DEBUG ONVIF Video Camera V1  Received HTTP header:  POST /event HTTP/1.1
2022-06-07T21:00:59.981014874+00:00 WARN ONVIF Video Camera V1  Received message for IPCam #1 (configure!) ignored (topic=tns1:Device/tnsaxis:IO/VirtualInput)
2022-06-07T21:01:00.013081207+00:00 DEBUG ONVIF Video Camera V1  Received HTTP header:  POST /event HTTP/1.1
2022-06-07T21:01:00.049645540+00:00 WARN ONVIF Video Camera V1  Received message for IPCam #1 (configure!) ignored (topic=tns1:Device/tnsaxis:IO/VirtualInput)
2022-06-07T21:01:00.076419874+00:00 DEBUG ONVIF Video Camera V1  Received HTTP header:  POST /event HTTP/1.1
2022-06-07T21:01:00.106395207+00:00 WARN ONVIF Video Camera V1  Received message for IPCam #1 (configure!) ignored (topic=tns1:Device/tnsaxis:IO/VirtualInput)
2022-06-07T21:01:00.233038874+00:00 DEBUG ONVIF Video Camera V1  Received HTTP header:  POST /event HTTP/1.1
2022-06-07T21:01:00.435142207+00:00 WARN ONVIF Video Camera V1  Received message for IPCam #1 (configure!) ignored (topic=tns1:Device/tnsaxis:IO/VirtualInput)
2022-06-07T21:02:00.623653703+00:00 FATAL ONVIF Video Camera V1  RPC Status: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {"date": "Tue, 07 Jun 2022 21:01:59 GMT"} }
Traceback:
stack traceback:
        [C]: in function 'coroutine.yield'
        [string "coroutine"]:9: in function 'wrapped_coroutine_resume'
        [string "cosock.lua"]:222: in field 'run'
        [string "st/driver.lua"]:737: in method 'run'
        [string "init.lua"]:876: in main chunk

Thanks for sharing this, Todd. Can you provide the driverID, please?
If we need more info, we’ll let you know.

Driver ID 1d68e5c3-03c0-4229-985d-14e225550049

As I stare at it some more, I see the error actually happened a full second after the last received data, so not sure it is related to the receive thread or not.

EDIT: No I think it IS related to the receive thread. Looking back at other logs for this particular camera, there can be up to 40 messages received at this time, so it looks like things crapped out after just the first 5.

Ok, thanks for the extra information. has this error been reported by a single user or you are getting it as well?

I have not reproduced it, so just one user reporting so far.

And I should say I’ve never seen this error before!

I just wanted to let you know I am looking into this.

Without getting into too much of the inner workings of our platform, a new hard time limit was introduced to some of the operations as part of our code sandbox. The limit was set to 60 seconds and the consequence of exceeding that number would be to have the driver restart. This limit was added after the bugs you helped us fix in the TLS socket implement, in hopes that we could discover bugs like that more quickly. It looks like at the very least having some more information about which operation failed would be good.

I can say for sure that this limit would not apply to any receive calls but potentially to either send or connect calls on a socket. It also would apply to things like Device:emit_event or Driver:try_create_device. The initial thinking is that 60 seconds would be more than enough time for any of those operations so exceeding that would mean we have a larger problem we need to consult.

Just out of curiosity, since it seems like this driver is working with video, are you sending large amounts of data across a socket?

Thanks for the update.

In this case I don’t think there are any socket send calls going on at the point of the error so it must have been one of the socket connect requests that was received from the device. There is no video streaming occurring at this point. This happened when the driver had just subscribed to motion events from the camera, and the camera sends back a burst of 40 or so ONVIF XML messages (requiring 40 connect/receive sequences) with various event states. This device is somewhat unique in that respect; most cameras only send back a handful of event state updates.

As most of these event messages don’t contain the motion event that the driver is looking for they are simply ignored once received. Only when the target motion event is seen is any kind of emit_event occurring.

Definitely no try_create_device is happening at this point.

40 connect/receive sequences

That seems like a lot of calls to both socket.tcp and skt:connect, is it done in a sequence or is it done with parallelism? I will look over these two code paths to see if anything stands out to me. At the very least I am going to add some work to the next release to provide more information on these timeouts.

This error occurred after only 5, but I know there would have been many more ‘in the queue’.

Maybe it is the way my handler is structured?:

  • during driver init, a thread is created (Thread.Thread()) and a socket handler registered via thread:register_socket
  • Upon incoming connect request, my handler accepts the connection, and then then issues a cosock.spawn() to receive & parse the XML data, and ultimately close the connection.

I suspect that the issue is skt:accept if this is a server socket.

1 Like

yes, it is a server socket

@robert.masen @nayelyz Any news on this?

I have a user (@amin300) that just reporting this same error again:

RPC Status: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {"date": "Mon, 11 Jul 2022 19:28:18 GMT"} }
Traceback:
stack traceback:
        [C]: in function 'coroutine.yield'
        [string "coroutine"]:9: in function 'wrapped_coroutine_resume'
        [string "cosock.lua"]:222: in field 'run'
        [string "st/driver.lua"]:737: in method 'run'
        [string "init.lua"]:935: in main chunk

Sorry but we do not yet have enough information to know where this is coming from in the driver framework. The next release will update this error to provide significantly more information and no longer crash the driver, instead returning nil, err when it occurs.

Once we can pin down where the error is coming from we can make the required improvements, sorry for the extended delay

This error is repeatable in this user’s case, if that is any help. Would a hub log be of any use?

The driver logs could be helpful. The error in question happens with a method takes more than 60 seconds to complete but currently doesn’t provide what method took that long. If there was some logging in the driver itself tracing what was happening when the error occurred then maybe that could help. If this is something you could provide me instructions on how to reproduce, I could also dig a little further that way.

Is the user in question signed up for the beta release? The 0.44 beta will include additional information and should be releasing soon so in the worst case we can get some more info when that happens.

@amin300 - Can you turn motion events off and then on again for your camera with logging running and confirm that you see that same error happening? If so, please go to the IDE and into the hub screen, click on the View Utilities link, and then Send Hub Logs?

@robert.masen - I believe the error is happening during a socket receive call in the process_http_message function within this module. Here are the corresponding log entries:

2022-07-11T19:27:18.920627398+00:00 DEBUG ONVIF Video Camera V1.1  Received HTTP header:        POST /event HTTP/1.1
2022-07-11T19:27:18.965752398+00:00 WARN ONVIF Video Camera V1.1  Received message for IPCam #1 (configure!) ignored (topic=tns1:Device/tnsaxis:IO/VirtualInput)
2022-07-11T19:28:19.000778739+00:00 FATAL ONVIF Video Camera V1.1  RPC Status: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {"date": "Mon, 11 Jul 2022 19:28:18 GMT"} }
Traceback:
stack traceback:
        [C]: in function 'coroutine.yield'
        [string "coroutine"]:9: in function 'wrapped_coroutine_resume'
        [string "cosock.lua"]:222: in field 'run'
        [string "st/driver.lua"]:737: in method 'run'
        [string "init.lua"]:935: in main chunk
2022-07-11T19:28:24.506238406+00:00 TRACE ONVIF Video Camera V1.1  Setup driver onvifDriver with lifecycle handlers:

So the error seems to hit during the receive loop starting on line 70.

I need to amend my previous post. The driver ‘subscribes’ to events with the camera, which all goes well, and then the camera starts sending event messages. In this particular case, the first event message is received just fine, but then the error seems to happen when the second event is sent by the device. So rather than happening on line 70 in the process_http_message function, it more likely could have happened back within the eventaccept_handler function, since the next http header was never logged.