[SmartThings Edge] Issue with HTTP requests in LAN drivers

@nayelyz This latest firmware update (000.043.00003) seems to have broken some of my LAN drivers. The ones that have stopped working exhibit the same problem: when attempting an HTTP request, they get immediate socket timeouts. I have tried rebooting my hub, but that hasn’t helped. I’ve also tried deleting and recreating devices, and reloading the driver. Nothing helps. These drivers have been working fine up until now, so there is something in this new firmware that is causing this.

Here is typical code snippet:

local http = require "socket.http"
http.TIMEOUT = 5

sendurl = 'http://192.168.1.129:8000/onvif/device_service'   << example

ret, code, headers, status = http.request {
      method = 'POST',
      url = sendurl,
      headers = sendheaders,
      source = ltn12.source.string(sendbody),
      sink = ltn12.sink.table(responsechunks)
    }

These requests are going to valid and working IP addresses, yet they are immediately returning this error in the code return value, despite the configured 5 second timeout:

[string “socket”]:1535: timeout

As I mentioned above, I am seeing this behavior now in more than one driver (Shelly, Web Requestor, ONVIF, for examples) using HTTP, yet strangely I have others that seem to be still working OK (e.g. Roku).

I hope you can get the engineering team’s attention on this quickly. This is going to effect a lot of people.

1 Like

Ouch. Presumably you are on an external alpha programme as it hasn’t reached external beta yet. Don’t you get a feedback channel?

Hi, @TAustin
I’m already checking this with the internal team. I’ll let you know in case I need more info.

1 Like

It is highly discouraged to use any of the lua socket apis directly as they will always block your driver’s progress entirely while waiting on IO. In the above example, updating to use the cosock equivalent would be the ideal

local cosock = require "cosock"
local http = cosock.asyncify "socket.http"
-- ...
1 Like

Thank you, this solved the issue. For some reason, I didn’t think we needed to use the cosock.asyncify function anymore - that it was done automatically, but obviously I was mistaken.

It seems evident that some logic has been added to the latest firmware to literarily prevent drivers from directly using the HTTP library now - presumably to help protect us from ourselves!

1 Like

Dang. This is probably going to hit a bunch of drivers. I think the ST samples also use http directly.

Yes, and the documentation up to a few months ago did not have the clarification it does now about using asyncify with the http library (including in the RESTful example given), so I suspect you’re right.

1 Like

Yes. Looking at this, I wonder if this may have been contributing to all of the socket closed errors being reported on http calls as well.

1 Like

Maybe. I suspect this update will contain fixes to some of the reported socket problems so between using cosock.asyncify and the new firmware, fingers crossed those will be behind us now. We’ll have to see when it is officially available and they give details.

1 Like

The discussion of those socket closed errors and the circumstances in which they appeared reminded me of something I once read about a widely used socket library having a buffer handling bug (or something along those lines) that resulted in sockets being closed when they shouldn’t have been. It was so well understood that it was possible to identify with total confidence when that is what had happened and so recover. Unfortunately the socket library was so widely used and copied that the bug persists to this day and socket libraries still need to be written to account for it (and sometimes that gets overlooked). Something like that anyway.

Do you remember what programming language that library was for ?

Crazy how bugs like that can end up being so widespread and persistent!

I am currently recovering by checking the response code and reissuing the request. Works pretty reliably but feels a little hacky.

function HttpClient.isTruncated(code)
    return code ~= nil and type(code) == "string" and (code:match('socket.*closed') ~= nil or code:match('socket.*does not exist') ~= nil)
end
if HttpClient.isSuccess(request.code) then
    device:emit_event(...)
elseif HttpClient.isTruncated(request.code) then
    device.thread:call_with_delay(1, function()
        -- call it again
    end)
end

Just track the number of attempts to avoid an infinite loop

Berkeley-derived UNIX implementations often have a buffer overflow bug of this type when using a SIOCGIFCONF request. I don’t know if that’s the one @orangebucket was remembering. It was in the OS socket library, not at the language level. That one’s been around forever. Linux and Solaris do not have the problem.

2 Likes

It wouldn’t surprise me if it were C but I really have no detailed recollection and it is quite possible I fused stories together. I do remember seeing something about a Lua library having issues but that could be a) historical, and b) a completely unrelated library.

I can’t much about the exact bug either, but I think if simplified and trivialised to the extreme it could have been along the lines of trying to send line n + 1 when you have exactly n lines of data.

Unfortunately I am hopeless at remembering where I’ve seen things like this, largely because I wasn’t looking for them in the first place and got sidetracked.

1 Like

Story of my life right there :eyes:

2 Likes

I made the swap with mixed results

local http = cosock.asyncify "socket.http"

One driver worked fine with no change in behavior, while the other doesn’t work at all. In the case of the failing driver, there is a change in behavior for the status and code that are returned.

Before with require 'socket.http'

code:        200
status:      HTTP/1.1 200 OK

After with cosock.asyncify 'socket.http': (status is blank)

code:        [string "socket"]:1527: HTTP/1.1 HTTP/200 OK
status:

The working driver uses https, while the broken driver uses http.

https is not an option for the broken driver. It will refuse those connections. The sample code uses http, so I assume it works. Do you think the issue is the connection scheme or the device itself?

Oh boy, now you have me worried. One of my drivers uses both http and https. And having the change in the way code/status behaves is going to be a hassle to deal with. I better go back and regression test all the error recovery scenarios…

[string “socket”]:1527: HTTP/1.1 HTTP/200 OK

That doesn’t even make sense!! How could there be a socket error at the same time as an OK response?? Do you think it means the connection was closed?

Not sure what it means. I tried to parse the response data, but there wasn’t anything there. The code essentially holds the socket error plus what should be in the status field.

This sample has a comment I don’t quite understand:

if not body and code ~= "timeout" then
        local err = code -- in error case second param is error message

        error(string.format("error while setting switch status for %s: %s",
                            device.name,
                            err))
       elseif status ~= 200 then
         error(string.format("unexpected HTTP error response from %s: %s",
                             device.name,
                             status))
      elseif status == 200
        break
      end

In error case, the code is the error? It’s not the http result code?

One other thing I’ve noticed when using asyncify, is that it really creates a tossed salad in your log output if you have many devices because rather than things being executed fairly sequentially for each device (due to the blocking), things are now executed in a jumbled order, making it really hard to follow threads of device execution!

Yep. I have device:pretty_print() all over the place so I can pair up log statements for a given device. I’m probably going to implement a DeviceLogger to wrap that up and make it reusable.

1 Like