Unschedule() API Execution Time Too Long?


(Geko) #1

This is a question for SmartThings software developers.

During smart app development I’ve run into a problem when one of my functions occasionally triggered java.util.concurrent.TimeoutException which means that it’s exceeded 20 sec execution limit.

With some debugging I was able to narrow it down to calling the unschedule() API. A simple test revealed that executing unschedule() API can take anywhere from 90 milliseconds to 3.7 seconds (!). And that’s on a good day. Occasionally, unschedule() gets stuck for longer than 20 seconds, resulting in java.util.concurrent.TimeoutException.

Here’s a simple test that I ran:

private def myUnschedule() {
    def t0 = now()
    unschedule()
    def t = now() - t0
    log.trace "unschedule() is executed in ${t} ms"
}

And here’s what it prints:

XXXX 10:50:46 AM PST: trace unschedule() is executed in 1420 ms
XXXX 10:50:45 AM PST: trace Deleting all scheduled jobs for InstalledSmartApp: XXXX
XXXX 10:50:42 AM PST: trace unschedule() is executed in 3652 ms
XXXX 10:50:38 AM PST: trace Deleting all scheduled jobs for InstalledSmartApp: XXXX
XXXX 10:50:35 AM PST: trace unschedule() is executed in 3744 ms
XXXX 10:50:31 AM PST: trace Deleting all scheduled jobs for InstalledSmartApp: XXXX
XXXX 10:50:28 AM PST: trace unschedule() is executed in 2383 ms
XXXX 10:50:25 AM PST: trace Deleting all scheduled jobs for InstalledSmartApp: XXXX

A related question is what happens to the smart app once it hits TimeoutException? I have a suspicion that it’s left in a “zombie” state where it does not properly responds to events and can no longer update its state.

Any help resolving this issue would be greatly appreciated.

P.S.

Here’s another test tun: unschedule() exceeds 10 seconds (!)

XXXX 11:26:17 AM PST: trace unschedule() is executed in 5045 ms
XXXX 11:26:12 AM PST: trace Deleting all scheduled jobs for InstalledSmartApp: XXXX
XXXX 11:26:09 AM PST: trace unschedule() is executed in 10467 ms
XXXX 11:25:58 AM PST: trace Deleting all scheduled jobs for InstalledSmartApp: XXXX
XXXX 11:25:53 AM PST: trace unschedule() is executed in 5153 ms
XXXX 11:25:48 AM PST: trace Deleting all scheduled jobs for InstalledSmartApp: XXXX
XXXX 11:25:16 AM PST: trace unschedule() is executed in 8093 ms
XXXX 11:25:08 AM PST: trace Deleting all scheduled jobs for InstalledSmartApp: XXXX

… and another one: unschedule() exceeding 15 seconds!

XXXX 12:15:08 PM PST: trace unschedule() is executed in 12704 ms
XXXX 12:14:56 PM PST: trace Deleting all scheduled jobs for InstalledSmartApp: XXXX
XXXX 12:15:08 PM PST: trace unschedule() is executed in 14354 ms
XXXX 12:14:54 PM PST: trace Deleting all scheduled jobs for InstalledSmartApp: XXXX
XXXX 12:15:08 PM PST: trace unschedule() is executed in 15502 ms
XXXX 12:14:53 PM PST: trace Deleting all scheduled jobs for InstalledSmartApp: XXXX
XXXX 12:14:51 PM PST: trace unschedule() is executed in 887 ms
XXXX 12:14:51 PM PST: trace Deleting all scheduled jobs for InstalledSmartApp: XXXX


"Pollster" - A SmartThings Polling Daemon
Failed Routines - possible explanation revealed by live log
Turn on at sunrise or sunset not working
Execution times increasing
Smart Alarm is here
Why can't SmartThings get simple schedules right?
"Pollster" - A SmartThings Polling Daemon
(Geko) #2

Well… I’ve added some code to keep track of maximum execution time and it didn’t take long to hit 20 seconds mark.

if (t > state.t_max) {
    state.t_max = t
    log.trace "state: ${state}"
}

XXXX 1:06:53 PM PST: trace state: [t_max:20086]

So, what’s the deal with unschedule() taking more than 20 seconds to execute? An what is the implication for the Smart Apps that got busted by the TimeoutException?

P.S.

Ok. I managed to capture TimeoutException in the log:

XXXX 1:21:56 PM PST: error java.util.concurrent.TimeoutException: Execution time exceeded 20 app execution seconds: 882354304390580 @ line 61
XXXX 1:21:56 PM PST: trace Zombie Test was provided…creating subscription
XXXX 1:21:55 PM PST: trace Scheduling ‘zombieTest’ for InstalledSmartApp: XXXX
XXXX 1:21:46 PM PST: trace Deleting scheduled job ‘zombieTest’ for InstalledSmartApp: XXXX
XXXX 1:21:45 PM PST: trace state: [t_max:19066]
XXXX 1:21:45 PM PST: trace unschedule() is executed in 19066 ms
XXXX 1:21:27 PM PST: trace Deleting all scheduled jobs for InstalledSmartApp: XXXX
XXXX 1:21:26 PM PST: trace initialize with settings: [:]


(Geko) #3

Opened support ticket:

Support request #86683: Smart App TimeoutException: Execution time exceeded 20 app execution seconds


(Geko) #4

New execution time record: 27.7 seconds!

8:03:57 PM PST: trace state: [t_max:27741]

UPDATED 2/16/2015
Another record: 32.1 seconds!

5:16:28 PM PST: trace state: [t_max:32148]

This looks like a serious issue to me. Would appreciate any comments.


(Geko) #5

Got response form support suggesting that I get @Ben and @mager involved in this discussion. So there it is…


(Gary D) #6

I’ve got similar responses from support@ if it even hints at dev related. You’d think its be easier for a ST employee to forward an email or ticket, but maybe not. (Sadly, getting someone to read a forum post can be… Difficult )

I hope you get this one addressed. If not, perhaps bring it up at the Dev conference call next week…

Take care
Gary


(Ben Edwards) #7

We’re getting the right people to look at this thread. Hang in there.


(Matt Nohr) #8

I totally agree that calling unschedule() takes a long time. In the past few minutes as I’ve been looking at this thread, the average time to unschedule was right around 3 seconds, which is close to what you were seeing. I did see some very fast ones too, around 4 ms, but that didn’t seem to be the normal. Even with 3 seconds, that is a good chunk of the 20 second total allowed, and if it slows down I’m not surprised you were seeing the TimeoutExceptions.

Unschedule actually has to do a number of things like finding all the jobs for this installation of the SmartApp and cleaning up data in a number of (very large) tables. I don’t have the data to prove it, but I assume the problem is getting worse over time.

For now, I know that doesn’t help much, and I’m sorry. Soon (hopefully) we’ll be rolling out a new scheduling mechanism behind the scenes that should be better able to handle this with much larger volumes of customers.

I’ve asked another SmartThings developer to chime in on this part. I think the SmartApp just gets shut down at the point of the timeout. It will run again if there are more schedules or subscriptions. In this case I think the unschedule() should finish, even if the exception is thrown, but I’ll let me colleague correct me if I’m wrong.


ST app not loading at all now?
(Aaron) #9

I wonder if this could somehow result in a situation where the platform has reported that a handler has run when, in fact, it didn’t. I’ve been trying to troubleshoot a failure here without much success Maybe what’s described here is an explanation.


(Geko) #10

I appreciate your looking into this issue. I monitored execution time over several hours period and the shortest execution time was about 75 ms, while the longest so far was 42 seconds (!) Typically, I see at least one occurrence exceeding 10 seconds within an hour.

My concern is that there’s a significantly high probability of failure for any app that uses unschedule() API. Is there any workaround you could recommend until it’s fixed on your end?


(Geko) #12

Please do not turn this thread into another architectural discussion. You’re welcome to spawn a new thread if you wish to go in that direction. Thanks.


(Gene Lerman) #13

This may explain spastic problems I’ve encountered with @RBoy “5-2 Day Thermostat” app where the scheduled temperature change does not occur. The initialize() function appears to be called at the scheduled time but the temperature setpoint change don’t occur. The first thing initialize() does is call unschedule(). Later, when there is a mode change, the app appears to function correctly again. That behavior is consistent with the expected behavior described by @matthewnohr .
I’m also interested in the workaround.


(Geko) #14

Just FYI, I also ran similar test measuring execution time of runOnce API and got similar results. Typical execution time is less than 1 second, but there are frequent spikes (several times per hour) when execution time exceed 10 and even 20 seconds.

I got 6 occurrences exceeding 20 seconds within the last 4 hours:

16:40 PM : 43.4 seconds
16:41 PM : 21.4 seconds
16:46 PM : 20.6 seconds
17:35 PM : 20.1 seconds
17:41 PM : 37.9 seconds
17:42 PM : 37.3 seconds

(Brian S. Lowrance) #15

I ran into this with my BigTalker SmartApp just now when settings were updated, I would get the above exception after hitting Done. Sometimes after 20 seconds, sometimes after 40 seconds. The delay came in when calling unschedule() or even when calling schedule(scheduleTime, onSchedule). It was not consistent though. It seemed very random, but did it often when updating settings.

For me, I adjusted my initialize() function to only do a couple of minor things (setting a couple of variables) and then it calls runIn(10, initSubscribe) to schedule actually running subscribe and schedule functions seconds later to do the heavy lifting outside of the initialize() function. This allowed the settings to be updated without throwing the exception in the log and red errors in the mobile app when hitting Done. It also released the mobile app interface so that it was no longer delaying and spinning/waiting on subscribe()'s and schedule()'s to finish processing. After doing this, when the runIn(10, initSubscribe) fired (10 seconds after being called from within initialize()), the SmartApp completed subscribing to events and scheduling jobs in 2 seconds. In a second test, it completed initSubscribe() in 9 seconds. In a third test, it completed initSubscribe() in 4 seconds. Either way, the user experience is much better this way as they do not have a long delay in saving preference changes and so far not getting the exception and red app errors.

This makes me wonder if subscribe(), schedule() being called too soon (without at least a delay like the 10 seconds I introduce in my method) after unsubscribe(), unschedule() may be the culprit. Spacing them out with runIn() seems to have resolved my issue at least.

5:27:28 PM: trace BIGTALKERDEV(1.0.3-Beta6) || Updated with settings 5:27:28 PM: trace Deleting all scheduled jobs 5:27:28 PM: trace Big Talker Dev is attempting to unsubscribe from all events 5:27:29 PM: trace BIGTALKERDEV(1.0.3-Beta6) || Initialized 5:27:39 PM: debug BIGTALKERDEV(1.0.3-Beta6) || Begin initSubscribe() ... 11 device subscriptions ... (truncated) 5:27:40 PM: trace Scheduling 'onSchedule1Event' 5:27:40 PM: trace Scheduling 'onSchedule2Event' 5:27:41 PM: trace Scheduling 'onSchedule3Event' 5:27:41 PM: trace Home was provided...creating subscription 5:27:41 PM: debug BIGTALKERDEV(1.0.3-Beta6) || END initSubscribe()

@Geko @Ben @mager


(Brian S. Lowrance) #16

I ended up still having trouble with the timeout, so I scheduled two runIn() functions within initialize(). One for initSubscribe() (currently at 20 seconds after initialize()) and one for initSchedule() 10 seconds later as it seems often scheduling takes too long and would throw the TimeoutException in the IDE. Perhaps by calling Scheduling within it’s own function it will not breach the timeout period.

EDIT: This is not working either. It worked for the first log below, but see the runIn() failure in the second and failure after schedule() in the third log. It just takes way too long to schedule() and initSubscribe() wasn’t event executed in the second log below. I continuously had problems with runIn(), runOnce() or even trying to do it with schedule(now()+60000, function); it would sometimes run and sometimes not as seen in log #2 below.

Success

7:48:27 PM: trace BIGTALKERDEV(1.0.3-Beta6) || Updated with settings: …
7:48:27 PM: trace Big Talker Dev is attempting to unsubscribe from all events
7:48:27 PM: trace Deleting all scheduled jobs
7:48:32 PM: debug BIGTALKERDEV(1.0.3-Beta6) || Scheduled initSubscribe() in 20 seconds
7:48:52 PM: debug BIGTALKERDEV(1.0.3-Beta6) || BEGIN initSubscribe()
…bunch of subscriptions…
7:48:52 PM: debug BIGTALKERDEV(1.0.3-Beta6) || END initSubscribe()
7:49:02 PM: debug BIGTALKERDEV(1.0.3-Beta6) || BEGIN initSchedule()
7:49:02 PM: trace Scheduling 'onSchedule1Event’
7:49:06 PM: trace Scheduling 'onSchedule2Event’
7:49:08 PM: trace Scheduling 'onSchedule3Event’
7:49:10 PM: debug BIGTALKERDEV(1.0.3-Beta6) || END initSchedule()


FAIL

8:00:11 PM: trace BIGTALKERDEV(1.0.3-Beta6) || Updated with settings
8:00:11 PM: trace Big Talker Dev is attempting to unsubscribe from all events
8:00:11 PM: trace Deleting all scheduled jobs
8:00:17 PM: debug BIGTALKERDEV(1.0.3-Beta6) || Scheduled initSubscribe() in 20 seconds
*initSubscribe() never executed!!!
8:00:47 PM: debug BIGTALKERDEV(1.0.3-Beta6) || BEGIN initSchedule()
8:00:47 PM: trace Scheduling 'onSchedule1Event’
8:00:48 PM: trace Scheduling 'onSchedule2Event’
8:00:51 PM: trace Scheduling 'onSchedule3Event’
8:00:55 PM: debug BIGTALKERDEV(1.0.3-Beta6) || END initSchedule()


FAIL

8:05:28 PM: trace Big Talker Dev is attempting to unsubscribe from all events
8:05:28 PM: trace Deleting all scheduled jobs
8:05:34 PM: debug BIGTALKERDEV(1.0.3-Beta6) || Scheduled initSubscribe() in 20 seconds
8:05:54 PM: debug BIGTALKERDEV(1.0.3-Beta6) || BEGIN initSubscribe()
…bunch of subscriptions…
8:05:54 PM: debug BIGTALKERDEV(1.0.3-Beta6) || END initSubscribe()
8:06:04 PM: debug BIGTALKERDEV(1.0.3-Beta6) || BEGIN initSchedule()
8:06:04 PM: trace Scheduling 'onSchedule1Event’
8:06:19 PM: trace Scheduling 'onSchedule2Event’
8:06:23 PM: trace Scheduling 'onSchedule3Event’
8:06:27 PM: error java.util.concurrent.TimeoutException: Execution time exceeded 20 app execution seconds

The odd thing is, Schedule3 is scheduled, but then it seems to hang until it throws the error at the 8:06:27 PM mark. The only thing that should occur after the Schedule3 is:
LOGDEBUG (“END initSchedule()”)
LOGTRACE(“Initialized”)
sendNotification(“BIGTALKER: Settings activated”)
}

LOGDEBUG() and LOGTRACE() only call out to log.debug() and log.trace() and prepend some info which you can see working in the logs above.
sendNotification() is there so that I/the User would know when the settings actually took affect since I couldn’t save them successfully directly within initialize().
But all of these functions existed and were called in the successful logs as well.

Scheduling is very important but it is a mess, it seems.


(Geko) #17

Yep. I tried all of these. It’s a chicken and egg problem. You try to defer initialization, but you have to call runIn or runOnce to schedule it. And since either of those can cause a timeout, you just move you point of failure to a different function.


(Jim Anderson) #18

Our most recent update should alleviate some of the performance issues with unschedule. Let us know if you’re seeing improved performance, or if you’re still seeing the same issues.


(Marc) #19

Apparently this is still an issue according to @geko as my Pollster application crashes once in awhile. When can we expect a fix?