Root Cause of most of my ST issues

Fuzzyligic · January 7, 2016, 9:59am

I have had probably the worst few weeks in introduction to the ST platform, i’m on a V2 hub with just under 100 devices connected. however i have got to the bottom of what causes most of my issues with ST. I am a UK user for your info

I am using Rule Machine heavily, god knows how you guys coped before this smartapp was released. I have generally good reliability from rules that are driven from events, sometimes the actions are delayed somewhat, like nipping into a room to get something and the light only coming on once i have got the thing i was after and am on the way back out the door. but i can live with that.

Timed schedules generally work for my as long as such as waking up by a light at 6am, if i use 6am i get maybe 60% reliability, if i use 05:58 i will get 90% reliability. i get better reliability if i stay away from 00, 15, 30 , 45 minutes past the hour slots, likewise if i use sunset sunrise the offsets i use i stay away from 15,30,45 and use an arbitrary minute near the value i want, then i get something like 90% reliability on these rules also.

i strongly believe this is due to a limit in the number of sequential threads that the ST cloud can process at once and i think they have a cleanup of long running threads to stop the platform falling over with masses of open threads, hence missed executions when the execution is queued.

What really bothers me are missed actions based on mode and SHM evaluation, these are 100% attributable to Modes and of course SHM modes sticking, the problem here is what i can see using the IDE. occasionally when changing modes (1 out of 5 times) i can see in the app my hub shows the mode has been changed, but through the IDE i can see this has not changed in the cloud. so this mode switch message is being missed, maybe due to the same issue with scheduling rules i.e. the thread is being killed.

but here is the rub say i have switched from night mode to home on my hub, and my hub shows Home but the cloud still shows Night, this is the same for SHM also with Disarmed and Armed (Away) still showing in the cloud. it seems Samsung then have a reconciliation action running from what i can see is approx every 10 minutes give or take a few minutes (maybe done on the hub itself?) to reconcile the state but they take the state in the cloud as the master and not the state on my hub (which IMHO would make more sense) the state on my hub then switches back to night for example, or back to Armed (Away) for SHM.

during this discrepancy i have no correct executions that rely on mode evaluation in the cloud, and i then after the reconciliation i lose my local execution actions based on mode evaluation as the mode is switched until i manually switch the mode back to the correct one.

again this mode sticking/discrepancy is not every time but very frequent.

Interestingly i when the mode is switched by the reconciliation, not one of the logs available in the IDE show the mode switching back to the state in the cloud. this is very makes this impossible to debug.

I am an Infrastructure monitoring consultant that has worked for an organisation with over 14000 servers globally that i was singularly responsible for monitoring, and the same challenges i have met in the past regarding event storms and thread executions do apply also to the smartthings platform, HA and infrastructure monitoring are identical in that respect, collect values and do things based on the evaluation of those values., obviously those 14000 servers will generate staggeringly far more events per hour than the average smartthings hub will, but i imagine that the number of hubs worldwide would far exceed the the 14000 number of servers so the amount of traffic would probably match.

I suspect (i am speculating, please ST correct me if i am wrong) but the ST platform has hit the same outages historically that i also experienced with my monitoring platform during event storms and having queued executions, and remediation has taken place to stop and prevent the issue occurring again, this has effectively been a sticking plaster on the platform to prevent outages due to resource exhaustion, but i suspect this has introduced the effective ghost in the machine. but the fixes have not been revisited, and in fact replicated onto the EU server farm.

And i know from personal experience that this kind of thing makes it extremely hard to troubleshoot and get to the root cause, as there is little correlation between these types of failures and as such you see a mass of generic issues all over the place that if you try to do a Pareto chart against all your issues there is no major candidate to work against just a sea of individual reports for intermittent failures i.e. the effective ghost in the machine.

I successfully managed to troubleshoot the same sort of issue in my last role which in my mind is very very similar to what i can see to be the general issue here. admittedly i can only see one side of the conversation here, i have no idea of the size of the infra you have to manage the ST platform, but i imagine it isn’t that large by the standards of some of the platforms i have worked on in the past.

I know though that if i take this through support i will get nowhere as there is no logs i can provide, only dates approximate times, by which point this ticket would get to the person who could deal most effectively there will be the issue regarding big data sources where the data would have been groomed by the time an investigation is underway. i can see that for the logs etc… raw data is groomed from the IDE after 7 days. I have no idea what the data warehousing solution is if any, but i can bet my bottom dollar the raw data is not warehoused for much time at all before it is groomed. which leaves what? aggregations of data? maybe/maybe not.

but for record my issue has occurred now on two mornings consecutively at 06:00 GMT where the mode has switched to Home but is still night in the cloud. and at about 06:10 this is switched back to Night mode. there is no smartapp or rule other than a single rule in rule machine that switches the mode to night mode, but it execution time i limited from 21:30 - 02:00 on that rule. looking a the events logs in the IDE i can see that rule switching to Night mode, like last night this occurred at just before 23:10, however the switch back to night mode this morning resulted in no logs showing this switch. but before it switched this morning back to night i looked at the IDE and this showed the hub was in night mode from 06:00 - 06:10, but the mode locally was showing as Home. once night mode was triggered again at 06:10 i manually changed the mode to Home and the cloud reflected this change within a few seconds.

please feel free to look for the root cause of my issue, i would love to work on it with the Right someone, but the UK support IMHO is pretty poor.

Regards

Stuart Buchanan

geko · January 7, 2016, 10:34am

You nailed it. This system is haunted. The sightings have been quite frequent lately. Some have been calling it poltergeist.

Visited by poltergiests on Halloween

bravenel · January 7, 2016, 3:42pm

Obviously, as you have already deduced, your problem is most likely the mishandled event spike at 06:00. You should move your mode change to an off time just as you did your other automations. Also, although this is obviously a cheap fix, add a second instance of the mode change logic several minutes later. If the mode change happened, nothing happens. If it didn’t happen, another chance for it to go.

I was having lots of trouble with my night to day mode change at an obvious time like sunrise. Moving it away did the trick, and it’s been reliable since.

SmartThings is well aware of what you have deduced, and many of them are clearly embarrassed by the mess. None of us know what they are doing to remedy the situation, and they aren’t shouting from the rooftops about it. Contacting support is pointless, IMO, the response would be “we are aware of this problem and are working on a fix, no eta”. The usual…

Kristopher · January 7, 2016, 4:44pm

Same here. Sunrise/sunset has been brutal all the way back to the kickstarter days. The desync of modes between the cloud and V2 is incredibly annoying and hard to manage. I think it is the most outstanding issue in the setup.

As Bruce mentioned, its cheap to have some safety nets around mode handling. I try to craft rules so that they’ll fire if a condition is hit on a broad window. For example, instead of my morning mode firing at 8 am, it fires if there’s motion anywhere from 8 am to 10am and my previous mode was Sleep or Away.

And indeed, Rule Machine is probably the best app to date

Fuzzyligic · January 7, 2016, 5:56pm

this obviously may be, but even then i don’t get 100% reliability only about 90%, probably because there is now an event spike at a minute two minutes to the hour because people are realising this is the case like yourself and scheduling accordingly, but TBH i want to wake up at 6am not 5:54, 5.56, 5.58

this i could also do, but if everyone did the same it would just compound this issue. Its frustrating, and as you said clearly embarrassing for ST, but i have been looking at history through these forums and it is grim reading. and leaves me with very little hope this will ever get resolved.

They really should be looking at the various monitoring platforms for their solutions for the same issues, and i am not talking about about the likes of Nagios, there are more resilient platforms. but most work on request caching on the agents with caches on the server that then feed into request execution and active processing queues.

this would enable 100% of executions to take place. i don’t mind a wait but the thing is if requests are submitted via TCP (i’ve not port mirrored and sniffed the device to figure this out) there is simply no excuse for as the receipt of any request has been acknowledged. so if its not executing then its being killed off as stated.

i’m excluding Z-wave and Zigbee issues if the request has been sent to the device, as these are usually timing issues or as a result of screwy logic in the rules. but this sort of stuff you can see in the logs

what annoys me even more is that i was starting to use Openhab a Raspberry PI 2 before i got smarthings, 100% local executions and IFTTT integration to boot, but i hadn’t got onto Zigbee & Zwave integration when Smarthings came to my attention (well at least that it was released in the UK) through the heavy marketing push. i thought why not

I have serious thoughts about going back down the openhab route, all though its nowhere near as pretty or user friendly as ST. but i think given the amount of coding i am doing so far to completely get what i want up and running in ST, i am thinking although it will be lot more effort than ST, but in the end if its more reliable then i should be considering it as my wife is moaning about the unreliability and saying i should not have bothered.

but it comes down to at the moment.i have time and money invested in ST and am trying to get a return on it. but smart things can be trusted to do the switching you can see, but locks and security, erm nope i would not trust that at all at the moment, and i do have them just waiting to be used…

I would love to get my hand in tracing this issue as i have experience on tracing this kind of issue. but yes support is useless with nothing but canned replies and unless you can work to trace this with specific cases then i fear that this debacle will never be resolved. some of you guys have been waiting a very very long time for this to be fixed…with no ETA.

or are my expectations too high for this product?

Fuzzyligic · January 7, 2016, 6:00pm

oh and one more point, why is there no dedicated troubleshooting/problems section in this community? is that because that section would recieve that many posts they would have to kill the posting requests to stop the community falling over

SBDOBRESCU · January 7, 2016, 6:07pm

Although I am among the lucky few (or so it seems) to rarely have sunset/sunrise issues - I can count a handful of times in over a year when things didn’t run as expected - but I always wondered why people with major issues don’t use IFTTT to turn a (virtual) switch on/off at sunset/sunrise - and use that switch as trigger for other “things”? I’ve used IFTTT to turn my Christmas tree at sunset and didn’t misfire once in several weeks - and you don’t even need to work around +/- overload minutes.

bravenel · January 7, 2016, 6:16pm

Personally, I think there will always be issues, but that improvement has happened and will continue. I don’t think these particular issues are baked in, I think they will be fixed.

I’m pretty much the eternal optimist sort.

Fuzzyligic · January 7, 2016, 6:32pm

the problem with IFTTT is it is far too limited in its scope. i really really need IFTTTTTOT for most of my rules in ST which is why Rules Engine is excellent.

please don’t take this the wrong way but that really is where the split is between ST users, you get those of you who are using it to schedule a small number of things based on a small number of sensors seem to be ok.

IFTTT has its place because of the number of channels it has, but in my opinion should be used the other way around, do evaluations in ST and then use IFTTT to trigger a device with no integration, this is how i am using it for my Tado Heating & HVAC thermostats as i have a read only device type now for each.

If i you are using it for is switching on Christmas Lights and simple evaluations such as that then i far cheaper and simpler way IMHO would be to use a cheap and 100% effective timer at a cost of £4, just my 2c worth.

SBDOBRESCU · January 7, 2016, 6:46pm

I was just talking about eliminating sun state issues, which Rule Machine is dependent on. So using Rule Machine doesn’t solve the platform issue!

LOL…this is funny. How do you know how many things I have? Just curious…

While that is true! It’s even much cheaper to use a Wemo insight that was just lying around the house Sure, I could have used the Wemo integration in ST, but …well you might know already how well Wemo works with ST.

Fuzzyligic · January 7, 2016, 6:56pm

I don’t, but i do see a lot of posts where people state they have no issues who do only have a small number of devices and simple automations. so pray tell how many you have? just curious

bravenel · January 7, 2016, 6:59pm

What is a “small number of sensors”? I have about 200 devices, of which 25% are sensors. Zero IFTTT. Count me in the group who has few problems. I have about 50 Rules and probably 50 other SmartApps doing automations.

The vast majority of my stuff works the vast majority of the time.

djekels · January 7, 2016, 7:17pm

where can i find this magic “rule Machine” ?

bravenel · January 7, 2016, 7:18pm

Follow this link:

SBDOBRESCU · January 7, 2016, 7:19pm

Ha, I am well beyond your 98 inventory, but unlike you, most of my devices connected to ST are eligible to run locally. Your 88 devices may be the root cause of the issues you are experiencing, especially if they are based on external APIs. If they are just custom device types, then you may have a good chance to see the stabilization that Bruce is talking about.

bravenel · January 7, 2016, 7:21pm

Bobby, I’m on hub V1 with 200 devices. Very few problems, 100% cloud based…

SBDOBRESCU · January 7, 2016, 7:23pm

I know, but if @Fuzzyligic has most of devices based on external APIs, it can cause a lot of troubles.My remark wasn’t about local processing but rather about the type of devices.

bravenel · January 7, 2016, 7:55pm

Understood. The weak under-belly has always been external APIs. I avoid them after my experience with Wemo and TCP. Echo and Harmony work very well for me.

SBDOBRESCU · January 7, 2016, 7:59pm

Yup, and you can add Hue bridge to that list. Since I moved all my GE link bulbs to Hue, I have had 0 problems. Hue doesn’t “forget” to dim the bulbs nor loses them, and ST has been kind not to time them out so I can use them in Rule Machine

Fuzzyligic · January 7, 2016, 8:00pm

The majority of my automations are also now working a vast majority of times also. but as stated this is not a solid 100%
and what i am trying to drive home here is how important HA is when you start to use it for door locks and the physical security of your home.

but this is why i have referred this issue as the ghost in the machine as there are instances where this thrown out of the window and i was trying to throw my observations on the instances where this was failing and my reasoning why.

the problem is my house is wired in most cases where there is no manual overrides except on wall mounted tablets as i went about this in one big hit before i realised the issues that were prevalent on the platform. but maybe some people are happy driving away from their home hoping the lock will actually lock when they leave for example. me though with he current issues that do crop up occasionally though would not trust it. and unless i could see the door and manually test it is locked, as the status cannot be 100% guaranteed to be correct at the time you look to check on the incredibly buggy app (yes i use smarttiles.io which does improve the situation massively, but the status updates can sometimes be a bit slow also )

pretty sure my insurance would not want to pay out if i was burgled without forced entry not sure though!

i could rant and rant, its not all bad though. but i did experience i big issue just before Christmas that was a bug in the backend database where the devices i had could not be parsed. and the only way around this was to delete nearly all the devices/smartapps/rules until it was fixed. this took a hell of a lot of time and undid lots and lots of work. i was faced with the prospect of a Christmas in the dark as the pitiful response time from ST support would not have helped me. but in the end i had to work for nearly 2 days straight to fix, this highlighted what i would give for a Windows Based Admin app to re-input the rules etc… and even more so a way to back all this stuff up. as it seems ST’s solution in the same situation was to forcefully remove peoples devices also leaving them with the same issue of having to redo all the work, but with no explanation of the root cause, and an outage of over 3 and 1/2 weeks before support fixed

now the interesting thing is, the device that caused that issue on the backend database was caused by an officially supported device, an Aeon Relay.

there is a HA utopia to be had somewhere, but i don’t personally think that will be with ST any time soon. especially as they have announced V3 Hub integrated with the Samsung TV’s. they seriously need to get their house in order before having more kids

but i am British and us Brits are known for having a good moan. the same is not known on the other side of the pond AFAIK

Topic		Replies	Views
Theories for why my V2 hub is reliable General Discussion	59	7449	March 14, 2016
Time-based automations failing General Discussion	17	2177	July 23, 2016
Is it just me... ST just falling apart (April 2018) General Discussion hubv2	61	5687	April 17, 2018
Events not processing or extensively delayed (multiple rules engines) General Discussion	12	3389	September 29, 2017
My modes, working well for months, are falling apart (March 2018) General Discussion	2	493	March 8, 2018

Root Cause of most of my ST issues

Related topics