Weekly Update from Alex - 07/23/16

I wanted to take this week’s post to give an update on our Grails upgrade efforts to give full transparency into our process. Two weeks ago we announced a big change coming to our platform with upgrades coming to our Grails version. We are still planning on upgrading to Grails 2.5.4, but we deployed 2 test releases to some of our production servers and noticed some problems right away. We rolled back those changes immediately and are evaluating what caused the unexpected behavior that differed from our staging environment.

One thing we have to remember when we make big changes like this is, development and staging environments are not production. While we do our best to keep the staging and development environments in sync there will inevitably be differences between production and our testing environments. The biggest of which is the traffic and number of people using the environments. This is why pushing to a few preliminary servers was so important. I am happy to see that our testing paradigm worked and we were able to catch the problem before we pushed the update to all of our production environments.

We are committed to making sure this change will have broad compatibility and will only re-launch this upgrade once we are positive we have the broadest compatibility we can offer.


All that being said, we do recognize there have been some problems with scheduled apps timing out. I want to assure you this isn’t a problem with the scheduler itself. Every case we see with this type of behavior, the scheduler has worked as intended and sent the task to the SmartApp to run. We took a look at all the cases and believe we found a root cause for a lot of these instances. On Wednesday we pushed an update to the platform that we believe will go a long way to resolving the SmartApp timeouts.

I also want to let you know about a new feature we added to the platform. On Thursday we released a small update to the platform and we added a new feature for SmartApps that utilize OAuth. This feature will add a layer of security to your Web Service SmartApps (they are all OAuth) by supporting redirect URI validation when making the OAuth authorization code request. Documentation on this new feature can be found here.

We are continuing to strive to make this platform the greatest Smart Home platform on the planet. I hope everyone has a great weekend and I’ll see you all again next week!

-Alex

17 Likes

Does this mean the issue has been fixed or more still to be done?

Was this an update to the hub or the servers.
Just wondering as my hubs firmware version hasn’t changed.

Just as a FYI, it didn’t execute the scheduled method about 27 minutes ago. Like you said the scheduler says it “called” the function as your team can see below but it didn’t execute:

initialize 2016-07-23 4:30:00 PM EDT 2016-07-23 4:30:00.509 PM EDT 509 60671

This is just a simple thermostat app so it isn’t even heavy, all it does it set the thermostat temperature and that didn’t happen at 4:30 EST today. I had to manually go in and “update” the app for it to execute.

1 Like

We’re still digging into things. We’ll give updates accordingly.

1 Like

As of 2016-07-23 I still have timeouts. Yesterday I had a minimum of 8 failures. Most of the failures appear to bunch up in the morning and more often in the evening towards the end of the day. So if this fix has been in place for several days, it ain’t fixed. From my viewpoint, it has everything to do with SYSTEM LOAD!

1 Like

I discovered my two wake routines were not running the past few days. Sent a detailed note to support that I hope makes it to engineering. Every time a hub update is pushed, you have to go reset some/all of your routines. This obviously should not be necessary.

I’m considering moving all my routines into CoRE, but I’d prefer to use ST as intended. I await their response.

1 Like

I have mixed feelings about this. For you, specifically, I think it would be a valuable experiment to not make changes to your system for 6 months. See how far it degrades.

However, the value would be for ST and the community. For you… probably not so much.

You’re right, don’t think I can do that. When I have time I tinker. The best I can do is what I do now, try to use ST as it’s intended, and report issues to support with facts when it doesn’t have to do with my tinkering. While I have been changing some small things lately (mostly adding fun CoRE rules and removing RM), I have not done anything grand.

The fact is, this particular problem I just had, I am very aware of, it’s consistent, it’s happened after other hub updates. I already specifically leave things alone that I can and wait for things to fail before I do anything because the expected behavior is that a hub update should not require manual intervention by the user.

I’d suggest that ST have a hub on every production shard with a good set of routines/apps running with expected outputs using a mixture of SL, Routines, and (gasp) the most popular community apps. Copies of typical customer setups. I feel they would often (but not always) be able to trap the same things we do.

Lastly, due to my design this was very inconsequential to my household (no one noticed but me). I have notifications that go to me when things aren’t right, and I pressed one button to fix the issue. Yeah, I shouldn’t have had to spend five seconds fixing the problem, but it was five seconds in the scheme of things.

Why do you assume, otherwise? As far as I know, every memeber has his/her own set up at home and they install pre-release updates. Ask @slagle about it . Just because one’s experience is different than others, doesn’t mean that one is more cautious than the other, but luckier or more resilient to minor issues. I for one, never had to update my routines, or at least not that I can remember. Even those that fail once in a while, they came back on their own. Like @JDRoberts says, failures could happen to anyone!

1 Like

I am asking some of our engineers to look into your issues and others cited on this thread directly so we can make sure we are getting to the absolute root. Thanks!

6 Likes

True, I don’t know. I do believe my routines would eventually heal themselves, but they had not yet.

1 Like

Here is another snapshot of the end result of the problem.

If you haven’t already, can you email this over to support@ (if you have, please DM me the ticket number).

Thanks!

Here’s a new one today (and I had a very bad feeling about this one):
Since this morning my two lock apps have been throwing these errors:

523b6f2f-6068-48c2-a86c-d18befe5521c 12:23:29 PM: error java.lang.NullPointerException: Cannot invoke method contains() on null object

Took me 15 minutes to figure out that this line throwing the error

state.expiredLockList.contains(lock.id)

The state variable dropped the expiredLockList variable. As in disappeared, I had reinitialize the app again to get the variable back. This is beginning to look like a rerun of Rule Machine :frowning:

1 Like

Several people reporting problems with State:

http://status.smartthings.com Has been updated:

5 Likes

Not that it matters much but for the few rules I have left in Rule Machine all went headless this morning. It is clear to me that the SmartThings platform has suffered from data contamination to outright data loss.

What is discouraging is these problems are old and never solved. The failing of timed events and now this tells me nothing has changed. The old problems are still under the surface. My speculation is this is all being caused by an overloaded platform. I pray that it doesn’t degrade to the point it was the last time.

4 Likes

In case you didn’t see this:

5 Likes

For reference -

Platform Update = Update to ST’s backend servers (the ST “Cloud”).
Hub/Firmware Update = Update to your physical ST hub in your home.
App Update = Update to the iOS/Android app. Maybe the Windows Mobile app too but that’s wishful thinking :smile:

5 Likes

I have lost nearly half of my CoRE apps. This is REALLY REALLY BAD.

Is the problem supposed to be fixed? Are they still trying to restore the lost data?

I got a real mess on my hands at the moment.