[SmartThings Edge] Issue with HTTP requests in LAN drivers

This is what worked for me for the time being until I figure out more

local https = cosock.asyncify 'socket.http'
local http = require 'socket.http'

...

local request

if scheme == 'http' then
    request = http.request
elseif scheme == 'https' then
    request = https.request
end

local _, code, _, status = request {
    method = method,
    url = url,
    headers = headers,
    source = source,
    sink = ltn12.sink.table(raw)
}
1 Like

Let’s call the return values body and code. That code is handling the case where you don’t have a completed HTTP request, for example an unknown hostname or a timeout. In that case body is nil and code is the error message.

Update: Shouldn’t the sample code be checking whether code is 200, not status?

1 Like

I don’t have version 43 yet, but I am starting to see some reports of LAN drivers breaking on version 43. The https driver has been updated to use asyncify, the http driver was not due the issues described above. Has anyone else successfully made the move to 43? Sounds like @TAustin may have?

For the http vs https issues with asyncify, I have narrowed down the strange http behavior to GET/PATCH calls only. PUT calls seem to work fine. For the GET/PATCH with http, I see a code of [string "socket"]:1527: HTTP/1.1 HTTP/200 OK

No other differences that you are aware of, for example empty response bodies? Just pondering what sort of thing might break without the luxury of having the library code open in front of me.

No. I have tried removing everything else like empty request bodies, sinks, sources, etc. I am simply calling this now just to test

local _, code, _, status = request {
    url = url,
    headers = headers
}

url is just a plain http url with no queries or anything else
headers contains a single entry which is the auth token for the connected device.

I would note that this behavior is on hub version 42, not 43.

I have everything working OK for myself, but just got a report from someone who has been testing my ONVIF driver and it stopped working when they got the firmware update. Looking at those logs now – seems to be caused by socket timeouts occurring during HTTP requests. Will report more as I find out…

Thanks. Same boat. Hoping it isn’t device specific.

Looking through failed log with new firmware, besides the timeout I mentioned, got this nasty FATAL error which caused the driver to restart:

2022-05-16T20:31:24.539697011+00:00 FATAL ONVIF Video Camera V1  RPC Status: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {"date": "Mon, 16 May 2022 20:31:24 GMT"} }
Traceback:
stack traceback:
        [C]: in function 'coroutine.yield'
        [string "coroutine"]:9: in function 'wrapped_coroutine_resume'
        [string "cosock.lua"]:210: in field 'run'
        [string "st/driver.lua"]:737: in method 'run'
        [string "init.lua"]:836: in main chunk

This was just another HTTP request that had been executed a dozen times prior for this same device.

Is that a different problem to the one that started this thread?

I got a bit confused when you showed an example only using asyncify for https and not http as I inferred from the firmware release notes that there was a breaking change which would mean the http wouldn’t work any more when used like that.

The example was mine. I have been unable to get cosock.asyncify to work with http on v42. It does work for me with https on v42. The docs and release notes for v43 say that it is required for both http/https.

Oops, my error. I read that post several times and still flipped the conversation.

As @blueyetisoftware pointed out, the guidance is that asyncify needs to be used for both http and https. Any kind of blocking of the main driver thread is a big no-no. There may be one exception to this, and that is during the discovery lifecycle. I believe this executes on a separate coroutine, and thus you can do blocking socket calls. I’d like to be corrected by someone in-the-know if that is not the case.

After further review I think the socket timeout I reported today is due to a different issue and I retract that one.

The FATAL error, however is a whole different matter, and one I’ve never seen before. Not 100% certain it can be blamed on the new firmware. After exchanging messages with the user, it appeared to have happened during a scenario where the SmartThings app was doing it’s usual thing of throwing up Networking error messages and abnormally long pauses when interacting with it. So something was happening that created the condition for that FATAL error. Time will tell if it is reproduceable. Tagging @nayelyz for awareness of this error I reported here.

I’m pretty confident that these are still occurring even with asyncify. One of the alpha testers on my driver (@K_White) is seeing those timeouts even though he has the driver installed that utilizes asyncify. The timeouts happened immediately upon the hub updating. Every call is timing out. That is v43 with https

@TAustin For your “working” driver on v43, are you setting a socket timeout like the lan sample?

local body, code, headers, status = http.request({
        url = url,
        body = "on",
        -- this is unfortunately required to set a timeout locally
        create = function()
          local sock = socket.tcp()
          sock:settimeout(5)
          return sock
        end
      })

I have not been doing this for any drivers so far. I don’t see any other differences that account for the http failures on v42

I have what I guess would be called a global timeout, set like this:

http = cosock.asyncify “socket.http”
http.TIMEOUT=5

Seems to work OK.

The http/v42 combo isn’t working with asyncify for me due to a timing issue while receiving the status line. I copied over the lua libs so I could package them with my driver and add trace logging. I started with a straight copy and was able to duplicate the failures. Adding only this trace logging alleviated the issue. I can remove the logging and the error returns.

From http.lua

function metat.__index:receivestatusline()
    local status,ec = self.try(self.c:receive(5))

    -- ADDED THIS
    log.trace(string.format('self.c:receive(5) = %s, %s', status, ec))

    -- identify HTTP/0.9 responses, which do not contain a status line
    -- this is just a heuristic, but is what the RFC recommends
    if status ~= "HTTP/" then
        if ec == "timeout" then
            return 408
        end 
        return nil, status 
    end
    -- otherwise proceed reading a status line
    status = self.try(self.c:receive("*l", status))
    local code = socket.skip(2, string.find(status, "HTTP/%d*%.%d* (%d%d%d)"))

    -- ADDED THIS
    log.trace(string.format('self.c:receive("*l", status) = %s', status))

    return self.try(base.tonumber(code), status)
end

The version that works shows

self.c:receive("*l", status) = HTTP/1.1 200 OK

The version that doesn’t shows

self.c:receive("*l", status) = HTTP/1.1 HTTP/200 OK

Note the additional HTTP/ in the second version that is causing the issue.

@nayelyz I patched this issue with the following change to http.lua. I don’t view this as a permanent fix as this receive call should not be returning the extra HTTP/ prior to the 200. For the platform team, please let them know this behavior is specific to using asyncify and is present in at least firmware v42. I don’t have v43 yet to check there. Given the change in behavior by doing things like adding logging, this looks very much like a timing issue introduced by using asyncify. Thanks.

function metat.__index:receivestatusline()
    local status,ec = self.try(self.c:receive(5))
    -- identify HTTP/0.9 responses, which do not contain a status line
    -- this is just a heuristic, but is what the RFC recommends
    if status ~= "HTTP/" then
        if ec == "timeout" then
            return 408
        end 
        return nil, status 
    end
    -- otherwise proceed reading a status line
    status = self.try(self.c:receive("*l", status))
    local code = socket.skip(2, string.find(status, "HTTP/%d*%.%d* (%d%d%d)"))
    
    -- START PATCH TO GRAB CODE (Look for extra HTTP/ prior to the code)
    if code == nil then
        code = socket.skip(2, string.find(status, "HTTP/%d*%.%d* HTTP/(%d%d%d)"))
    end
    -- END PATCH TO GRAB CODE

    return self.try(base.tonumber(code), status)
end

Edit:

As I run this code, I am also seeing lines like this that have even more spurious HTTP/'s in the status

self.c:receive("*l", status) = HTTP/1.1 HTTP/200 OKHTTP/
self.c:receive("*l", status) = HTTP/1.1 HTTP/200HTTP/ OK

The patch is still working given that the regex fits, but the status is corrupted. I updated the patched section to this since the HTTP moves around

status = self.try(self.c:receive("*l", status))

-- START PATCH
local scrubbed, count = status:gsub('HTTP/','')
if count > 1 then
    log.warn(string.format('self.c:receive("*l", status) = %s', status))
    log.warn(string.format('scrubbed = %s', scrubbed))
end
local code = socket.skip(2, string.find(scrubbed, "%d*%.%d* (%d%d%d)"))
status = string.format('HTTP/%s', scrubbed)
-- END PATCH

return self.try(base.tonumber(code), status)

I would be curious to know if changing
status = self.try(self.c:receive("*l", status))
to
status = self.try(self.c:receive("*l"))
produces a sane status. If so there might be a safer place to prepend the HTTP/.

I am baffled that you are hitting a problem though. It just doesn’t seem like something that would go unremarked upon.

Yes. That also works. It looks like passing the status prefix to that line is the culprit. That prefix is being appended to each element of the buffer (sometimes) rather than just the beginning of the buffer.

That makes the final patched version a single line change from this

status = self.try(self.c:receive("*l", status))

to this

status = string.format('HTTP/%s', self.try(self.c:receive("*l")))

That means the bug is most likely in tcp.receive where it manipulates the prefix and parts of the message.