Scheduled jobs failing (again) (again 😄) (Ongoing Known Issue)

I don’t have an attention span long enough to open the app.

1 Like

:sob:

:dizzy_face:

For the third consecutive week, my periodic history download SmartApp failed to execute.
The only difference is that this time, running it by hand from the IDE, I got an ā€œerrorā€ notification in the log :

java.util.concurrent.TimeoutException: Execution time exceeded 20 app execution seconds: 122413839485071 @ line 112

which resulted in the usual IDE return code :

java.lang.reflect.UndeclaredThrowableException @ line 112

I wonder if somehow, the way I wrote my lines 111-112 in my ā€œJJ’s Test2ā€ SmartApp does not result in parallel threads competiting for the same DB access, and when not getting through, then going into time-out.

111    for (tempSens in temperatureSensors) {
112        def allEvents = tempSens.eventsBetween(startDate, endDate, [max:2000]).findAll{it.name == "temperature"}

But if it is the case, how can I modify my code to force serial execution and avoid this concurrent.TimeoutException ?
Sorry if it is a dumb question, but Java/Groovy is not my forte… :frowning:

But obviously, this yet does not explain why the same code ran without this problem for 2 years, and systematically fails to execute since 3 weeks; somebody at SmartThings must have done (or not done) something…

As far as I know, ā€œeventsBetween()ā€ is just a very expensive function that runs long in ā€œreal timeā€ unless the SmartThings database is running optimally. So this is likely a temporary problem.

I don’t think it has anything to do with concurrently or anything you can control … unless you write a convoluted SmartApp that only fetches smaller batches of Events, each with a new SmartApp invocation (using runIn or some other loop callback).

1 Like

Well, the java.util.concurrent package is clearly used by either Groovy and/or SmartApps, since it generates an error message !.
The question is only, does this multithreaded execution happens at the lowest DB access level (in that case, I cannot do much, and I agree with you), or is it somehow happening at a higher level, meaning some different Groovy code writing, with the same semantic, could overcome the timeout.
This happens frequently at SQL level, so I suppose it could happen too when accessing a SQL (supposedly) cloud DB through Groovy and the SmartThings cloud engine.

1 Like

This has been a problem forever. This error arbitrarily hits and causes apps to fail because they take too long to execute (sometimes they timeout before they even execute any real code). During busy times, this is more often to hit than the slow times. There is no way to get around it (other than trying a different time). Keep in mind, the failure is random. FYI, this also plagued the ecobee’s - hitting the ecobee website would sometimes trip the time limit because of the cloud to cloud dependencies (it’s much better now, but I suspect they lengthened the time for that specific use case).

For example, this weekend I saw this error for some of my zigbee bulb’s (stock DTH, but even they were failing because of this error).

I wish there was a way to distinguish between apps that are consuming too many resources and running wild versus the ones that exceed the time limit because of platform utilization or cloud dependencies.

2 Likes

@geeji Saw this earlier but have been a little tied up - the issue you are running into is directly related to the current latency problems with the events database. The Simple Device Viewer SmartApp also saw a very large increase in timeouts and implemented a caching/paging workaround: https://github.com/krlaframboise/SmartThings/blob/master/smartapps/krlaframboise/simple-device-viewer.src/simple-device-viewer.groovy
Though I’m not sure how easy it would be to adapt to your scenario. The take away is try to implement paging/request fewer events at a time and save them into state as a temporary cache. Personally I would wait if I were you to make those changes as they would probably require a significant amount of work on the SmartApp side - the latency problems with events are particularly bad as the queries are a few orders of magnitude higher for events (they should take microseconds to execute). I don’t want to make any promises (not my place to do so as I’m not the one performing the maintenance on the database) but if all goes well the event latency issues should be resolved by the end of the week. They aren’t really related to user generated load but they increase depending on the state of the Cassandra Cluster at the time and where it is with Compactions/Garbage Collections/Data streaming for replication.

In regards to the timeout exceptions specifically - there are two flavors…
Soft Timeout: This occurs when a single SmartApp method executes for 20+ seconds.
Hard Timeout: This occurs when a SmartApp executes for 40+ seconds.
Its not related to multi threading apart from the fact that is the Exception we raise when a SmartApp has gone over those limits to cause an interrupt in the sandbox which runs in its own thread.

@bago I do see why the errors can seem random (to the user they might as well be) but I do want to mention that we monitor timeouts and I haven’t ran into an instance where they were actually ā€œrandomā€ - generally the issues fall into the following categories:

  1. GC overhead causing an increase in timeouts due to garbage collection pauses (we monitor this and have actually released a mitigation for scheduled executions which are most affected by this case especially at common execution times)
  2. Database Latency (we monitor this and are working on a fix but this is where the vast majority of the current timeouts are occurring)
  3. HTTP Requests (This can be random in a sense that we’re not monitoring APIs for response time and alerting on them but you can get around this by using the new async http framework that is in beta - though you can still run into connection timeout issues that are out of our control). One thing I will mention is that as SmartApps are migrating to the async model I am seeing timeouts drop dramatically.
  4. Bad SmartApp Code (Infinite loops, stack overflows, etc…)
7 Likes

@vlad. Thank you. Always good to hear your explanations.

I hope the errors subside.

Is there anyway to monitor for these and advise the 1) SmartApp authors, and 2) the Users if it doesn’t change?

1 Like

There are definitely cases where that happens - we have reached out to SmartApp authors before and followed up with requests to update apps. This has gone the other way around as well where a SmartApp author has reached out to us to help identify users with a bad version of their app so we could request them to update (though I don’t think there is a formal process in place for that afaik).

For most things like infinite loops/stackoverflow errors we could have better monitoring but it really is a priority issue - if its a custom SmartApp and only a few people are using it, chances are its not going to show up on monitoring at all (Please don’t take that as an invitation to prove me wrong >_>). I probably should’ve left a note on that line that said - this doesn’t affect other users as much as you would think. This also goes back to the whole rate limiting discussion… we are usually over provisioned with servers to handle load and have constant load tests running on all environments which are ā€œusuallyā€ the highest load driving users that are out there. For a single user (s) to slow things down there is usually some underlying issue that is exposed by their SmartApp that we haven’t accounted for. You have to do some crazy stuff (or hit a crazy bug) to hit our rate limits which I think are 250 executions per minute? (I am also speaking after months and months of performance/caching improvements though - so the landscape now is way different from what it was a year ago).

Very rarely there are server killers out there which are… interesting to track down like when this behavior was triggered in the sandbox that generated GB sized stack traces a few times before we caught it: https://issues.apache.org/jira/browse/GROOVY-7941 (double checked this was actually fixed before linking it :slight_smile: ) Our metrics/monitoring systems are fairly good though and we can generally catch runaway apps like that quickly and put a stop to them.

We also have a group of guys (@Aaron & a few others) who discussed this exact subject in the dev call a few calls ago to talk about what metrics we look at for this stuff.

TL;DR yes

4 Likes

@vlad - First I just want to say that I appreciate you participating in this thead.

That said, this thread is 25 days old and the length of this issue on the SmartThings status website is 10 days old.

Unfortunately this seems normal for SmartThing but unacceptable for basically any other system (HA or other).

When can we get a realistic idea of when this will be fixed? I am struggling to understand why this is taking so long unless there is no urgency around fixing the issue.

There’s definitely urgency … it’s just a very difficult problem.

Then again; I haven’t had any scheduling problems recently. Perhaps there are mitigating factors, and so I have to wonder exactly what proportion of SmartThings Customers are affected!

10-25 days is a looooong time.

4 years is even longer.

. https://www.kickstarter.com/projects/smartthings/smartthings-make-your-world-smarter/rewards

You right. It’s a shame.

1 Like

@vlad Thanks a lot for those detailed explanations, it is a very good thing some SmartThings staff monitors those (mostly complaints) forums.
It is most frustrating when your SmartApps don’t work anymore for some cloud-related unknown reason and when ST support throws you away because of ā€œcustom SmartApp, not our problemā€.

I hope you are not overly optimistic, and that those timeouts will actually disappear soon.
In the meantime, I will take your advice and not completely redesign my SmartApp, which in addition to the time involved could also make me stumble over some other unsuspected SmartThings glitch… :frowning:

The issue from 25 days ago was: Scheduled jobs failing (again) (again 😄) (Ongoing Known Issue) Which was a major issue overnight that was mitigated the next day by rolling restarts and resolved a few days later after we put out new JVM settings (and followed up with further updates to address memory usage in general).

This specific incident started 10 days days ago - the status update was made the morning after the event save timeouts began occurring. There is urgency around fixing this but yes, it is a very difficult problem… Downtime for the event ring = downtime for the platform.

@tgauchat The places you can run into issues right now are when a SmartApp or DTH reads/writes events. The more that occurs the higher chance of failure (SHM, CoRE, Simple Device Viewer, etc…).

5 Likes

Thanks for the update, Vlad.

  • From a Community Developer perspective, the details matter and I appreciate them.
  • From a Customer perspective, it’s harder to distinguish an old fixed problem from a new one, even if isolated, if the same Customer is ā€œimpactedā€.
3 Likes

I’m still seeing some random failure. Mostly in my notifications logs and recent activity logs. Although the past two days I have noticed that my ā€œI’m backā€ routine has not run. Presence does pick me up as home when I arrive, but ā€œI’m Backā€ doesn’t execute and I’ve been setting off intrusion detection on the front door.

Same here. I think it’s related to what Vlad mentioned a few posts back. It’s not an easy fix, so no timeline has been mentioned (AFAIK).

1 Like