Weekly Update from Alex - 07/23/16

alex · July 23, 2016, 6:37pm

I wanted to take this week’s post to give an update on our Grails upgrade efforts to give full transparency into our process. Two weeks ago we announced a big change coming to our platform with upgrades coming to our Grails version. We are still planning on upgrading to Grails 2.5.4, but we deployed 2 test releases to some of our production servers and noticed some problems right away. We rolled back those changes immediately and are evaluating what caused the unexpected behavior that differed from our staging environment.

One thing we have to remember when we make big changes like this is, development and staging environments are not production. While we do our best to keep the staging and development environments in sync there will inevitably be differences between production and our testing environments. The biggest of which is the traffic and number of people using the environments. This is why pushing to a few preliminary servers was so important. I am happy to see that our testing paradigm worked and we were able to catch the problem before we pushed the update to all of our production environments.

We are committed to making sure this change will have broad compatibility and will only re-launch this upgrade once we are positive we have the broadest compatibility we can offer.

All that being said, we do recognize there have been some problems with scheduled apps timing out. I want to assure you this isn’t a problem with the scheduler itself. Every case we see with this type of behavior, the scheduler has worked as intended and sent the task to the SmartApp to run. We took a look at all the cases and believe we found a root cause for a lot of these instances. On Wednesday we pushed an update to the platform that we believe will go a long way to resolving the SmartApp timeouts.

I also want to let you know about a new feature we added to the platform. On Thursday we released a small update to the platform and we added a new feature for SmartApps that utilize OAuth. This feature will add a layer of security to your Web Service SmartApps (they are all OAuth) by supporting redirect URI validation when making the OAuth authorization code request. Documentation on this new feature can be found here.

We are continuing to strive to make this platform the greatest Smart Home platform on the planet. I hope everyone has a great weekend and I’ll see you all again next week!

-Alex

RBoy · July 23, 2016, 7:58pm

Does this mean the issue has been fixed or more still to be done?

bobbles · July 23, 2016, 8:08pm

Was this an update to the hub or the servers.
Just wondering as my hubs firmware version hasn’t changed.

RBoy · July 23, 2016, 9:00pm

Just as a FYI, it didn’t execute the scheduled method about 27 minutes ago. Like you said the scheduler says it “called” the function as your team can see below but it didn’t execute:

initialize 2016-07-23 4:30:00 PM EDT 2016-07-23 4:30:00.509 PM EDT 509 60671

This is just a simple thermostat app so it isn’t even heavy, all it does it set the thermostat temperature and that didn’t happen at 4:30 EST today. I had to manually go in and “update” the app for it to execute.

slagle · July 23, 2016, 11:15pm

We’re still digging into things. We’ll give updates accordingly.

an39511 · July 24, 2016, 11:58am

As of 2016-07-23 I still have timeouts. Yesterday I had a minimum of 8 failures. Most of the failures appear to bunch up in the morning and more often in the evening towards the end of the day. So if this fix has been in place for several days, it ain’t fixed. From my viewpoint, it has everything to do with SYSTEM LOAD!

bridaus · July 24, 2016, 12:14pm

I discovered my two wake routines were not running the past few days. Sent a detailed note to support that I hope makes it to engineering. Every time a hub update is pushed, you have to go reset some/all of your routines. This obviously should not be necessary.

I’m considering moving all my routines into CoRE, but I’d prefer to use ST as intended. I await their response.

JH1 · July 24, 2016, 2:20pm

I have mixed feelings about this. For you, specifically, I think it would be a valuable experiment to not make changes to your system for 6 months. See how far it degrades.

However, the value would be for ST and the community. For you… probably not so much.

bridaus · July 24, 2016, 2:29pm

You’re right, don’t think I can do that. When I have time I tinker. The best I can do is what I do now, try to use ST as it’s intended, and report issues to support with facts when it doesn’t have to do with my tinkering. While I have been changing some small things lately (mostly adding fun CoRE rules and removing RM), I have not done anything grand.

The fact is, this particular problem I just had, I am very aware of, it’s consistent, it’s happened after other hub updates. I already specifically leave things alone that I can and wait for things to fail before I do anything because the expected behavior is that a hub update should not require manual intervention by the user.

I’d suggest that ST have a hub on every production shard with a good set of routines/apps running with expected outputs using a mixture of SL, Routines, and (gasp) the most popular community apps. Copies of typical customer setups. I feel they would often (but not always) be able to trap the same things we do.

Lastly, due to my design this was very inconsequential to my household (no one noticed but me). I have notifications that go to me when things aren’t right, and I pressed one button to fix the issue. Yeah, I shouldn’t have had to spend five seconds fixing the problem, but it was five seconds in the scheme of things.

SBDOBRESCU · July 24, 2016, 3:25pm

Why do you assume, otherwise? As far as I know, every memeber has his/her own set up at home and they install pre-release updates. Ask @slagle about it . Just because one’s experience is different than others, doesn’t mean that one is more cautious than the other, but luckier or more resilient to minor issues. I for one, never had to update my routines, or at least not that I can remember. Even those that fail once in a while, they came back on their own. Like @JDRoberts says, failures could happen to anyone!

alex · July 24, 2016, 4:32pm

I am asking some of our engineers to look into your issues and others cited on this thread directly so we can make sure we are getting to the absolute root. Thanks!

bridaus · July 24, 2016, 4:36pm

True, I don’t know. I do believe my routines would eventually heal themselves, but they had not yet.

an39511 · July 24, 2016, 8:27pm

Here is another snapshot of the end result of the problem.

Aaron · July 25, 2016, 3:19am

If you haven’t already, can you email this over to support@ (if you have, please DM me the ticket number).

Thanks!

RBoy · July 25, 2016, 4:27pm

Here’s a new one today (and I had a very bad feeling about this one):
Since this morning my two lock apps have been throwing these errors:

523b6f2f-6068-48c2-a86c-d18befe5521c 12:23:29 PM: error java.lang.NullPointerException: Cannot invoke method contains() on null object

Took me 15 minutes to figure out that this line throwing the error

state.expiredLockList.contains(lock.id)

The state variable dropped the expiredLockList variable. As in disappeared, I had reinitialize the app again to get the variable back. This is beginning to look like a rerun of Rule Machine

JDRoberts · July 25, 2016, 4:32pm

Several people reporting problems with State:

http://status.smartthings.com Has been updated:

an39511 · July 25, 2016, 4:46pm

Not that it matters much but for the few rules I have left in Rule Machine all went headless this morning. It is clear to me that the SmartThings platform has suffered from data contamination to outright data loss.

What is discouraging is these problems are old and never solved. The failing of timed events and now this tells me nothing has changed. The old problems are still under the surface. My speculation is this is all being caused by an overloaded platform. I pray that it doesn’t degrade to the point it was the last time.

SBDOBRESCU · July 25, 2016, 5:02pm

In case you didn’t see this:

Benji · July 25, 2016, 6:17pm

For reference -

Platform Update = Update to ST’s backend servers (the ST “Cloud”).
Hub/Firmware Update = Update to your physical ST hub in your home.
App Update = Update to the iOS/Android app. Maybe the Windows Mobile app too but that’s wishful thinking

an39511 · July 25, 2016, 6:43pm

I have lost nearly half of my CoRE apps. This is REALLY REALLY BAD.

Is the problem supposed to be fixed? Are they still trying to restore the lost data?

I got a real mess on my hands at the moment.

Topic		Replies	Views
Weekly Update from Alex - 07/10/16 Announcements	119	13038	October 31, 2016
Announcement: New "Update" feature in the IDE Announcements	97	11181	February 27, 2016
Scheduled jobs failing (again) (again 😥) (Ongoing Known Issue) General Discussion	194	14876	November 25, 2016
CoRE SmartApp Changes - 1/5/17 Announcements	126	14367	March 9, 2017
Announcement: Update to Database Announcements	230	12457	January 25, 2016

Weekly Update from Alex - 07/23/16

Related topics