Platform timeouts a lot now (7 September 2018)


(Eric) #1

I’m seeing a lot of red in the IDE for execution timeouts…seems like the platform / infrastructure is have problems…

Others seeing this?


[RELEASE] NST Manager v5.0
#2

Unsure if it’s related, but there’s a discussion in the webCoRE community that suggests a problem with ST scheduling and piston executions timing out. That problem started yesterday as well and is continuing this morning.

Pistons with Waits Failing

FYI, I am on the https://graph-na02-useast1.api.smartthings.com/ shard.


(Eric) #3

sounds consistent with what I see.


#4

I’m curious, what shard are you on—i.e., the URL after you log into the IDE?


(Eric) #5

graph.api.smartthings.com


#6

We have also confirmed that there were at least some changes to scheduling in the ST platform over the past 24 hours related to this problem that went on for a few months. As recently as yesterday morning any smart apps using runIn() to schedule a future event more than 25 days in the future were immediately reinvoked with a time event related to that schedule, but as of today those schedules are functioning properly.

Perhaps the solution to that problem has caused some regression for scheduling, or the update introduced general latency issues that cause these timeouts. It did appear in @bthrock’s example that the scheduling could be failing due to a timeout.

I am also on graph.api.smartthings.com


[RELEASE] NST Manager v5.0
(Eric) #7

In my case, I see it both with DTHs and smart apps. Using schedule() or runIn(). Not always, but very consistently failing (say 1 in 3 or 1 in 5).


#8

The issue seems to have been resolved, at least for now.


(vlad) #9

We’ve identified an issue with the latest deployment that affected the performance of community smart apps, with the majority of the issues coming from scheduled executions. The update had an impact on the way that smartapps/devicetypes are loaded into memory and the net effect is that that they executed slower because of the increased load, which caused a higher frequency of app timeouts (an execution is canceled if it runs longer than 20 seconds). We’ve made tuning changes to the servers running the apps and the number of timed out executions is back down to average frequency.


[RELEASE] NST Manager v5.0
(Eric) #10

Thx @vlad
Things do seem much improved.

I am still seeing regular timeouts (but not as frequently).

I will PM you with details.