I have had probably the worst few weeks in introduction to the ST platform, i’m on a V2 hub with just under 100 devices connected. however i have got to the bottom of what causes most of my issues with ST. I am a UK user for your info
I am using Rule Machine heavily, god knows how you guys coped before this smartapp was released. I have generally good reliability from rules that are driven from events, sometimes the actions are delayed somewhat, like nipping into a room to get something and the light only coming on once i have got the thing i was after and am on the way back out the door. but i can live with that.
Timed schedules generally work for my as long as such as waking up by a light at 6am, if i use 6am i get maybe 60% reliability, if i use 05:58 i will get 90% reliability. i get better reliability if i stay away from 00, 15, 30 , 45 minutes past the hour slots, likewise if i use sunset sunrise the offsets i use i stay away from 15,30,45 and use an arbitrary minute near the value i want, then i get something like 90% reliability on these rules also.
i strongly believe this is due to a limit in the number of sequential threads that the ST cloud can process at once and i think they have a cleanup of long running threads to stop the platform falling over with masses of open threads, hence missed executions when the execution is queued.
What really bothers me are missed actions based on mode and SHM evaluation, these are 100% attributable to Modes and of course SHM modes sticking, the problem here is what i can see using the IDE. occasionally when changing modes (1 out of 5 times) i can see in the app my hub shows the mode has been changed, but through the IDE i can see this has not changed in the cloud. so this mode switch message is being missed, maybe due to the same issue with scheduling rules i.e. the thread is being killed.
but here is the rub say i have switched from night mode to home on my hub, and my hub shows Home but the cloud still shows Night, this is the same for SHM also with Disarmed and Armed (Away) still showing in the cloud. it seems Samsung then have a reconciliation action running from what i can see is approx every 10 minutes give or take a few minutes (maybe done on the hub itself?) to reconcile the state but they take the state in the cloud as the master and not the state on my hub (which IMHO would make more sense) the state on my hub then switches back to night for example, or back to Armed (Away) for SHM.
during this discrepancy i have no correct executions that rely on mode evaluation in the cloud, and i then after the reconciliation i lose my local execution actions based on mode evaluation as the mode is switched until i manually switch the mode back to the correct one.
again this mode sticking/discrepancy is not every time but very frequent.
Interestingly i when the mode is switched by the reconciliation, not one of the logs available in the IDE show the mode switching back to the state in the cloud. this is very makes this impossible to debug.
I am an Infrastructure monitoring consultant that has worked for an organisation with over 14000 servers globally that i was singularly responsible for monitoring, and the same challenges i have met in the past regarding event storms and thread executions do apply also to the smartthings platform, HA and infrastructure monitoring are identical in that respect, collect values and do things based on the evaluation of those values., obviously those 14000 servers will generate staggeringly far more events per hour than the average smartthings hub will, but i imagine that the number of hubs worldwide would far exceed the the 14000 number of servers so the amount of traffic would probably match.
I suspect (i am speculating, please ST correct me if i am wrong) but the ST platform has hit the same outages historically that i also experienced with my monitoring platform during event storms and having queued executions, and remediation has taken place to stop and prevent the issue occurring again, this has effectively been a sticking plaster on the platform to prevent outages due to resource exhaustion, but i suspect this has introduced the effective ghost in the machine. but the fixes have not been revisited, and in fact replicated onto the EU server farm.
And i know from personal experience that this kind of thing makes it extremely hard to troubleshoot and get to the root cause, as there is little correlation between these types of failures and as such you see a mass of generic issues all over the place that if you try to do a Pareto chart against all your issues there is no major candidate to work against just a sea of individual reports for intermittent failures i.e. the effective ghost in the machine.
I successfully managed to troubleshoot the same sort of issue in my last role which in my mind is very very similar to what i can see to be the general issue here. admittedly i can only see one side of the conversation here, i have no idea of the size of the infra you have to manage the ST platform, but i imagine it isn’t that large by the standards of some of the platforms i have worked on in the past.
I know though that if i take this through support i will get nowhere as there is no logs i can provide, only dates approximate times, by which point this ticket would get to the person who could deal most effectively there will be the issue regarding big data sources where the data would have been groomed by the time an investigation is underway. i can see that for the logs etc… raw data is groomed from the IDE after 7 days. I have no idea what the data warehousing solution is if any, but i can bet my bottom dollar the raw data is not warehoused for much time at all before it is groomed. which leaves what? aggregations of data? maybe/maybe not.
but for record my issue has occurred now on two mornings consecutively at 06:00 GMT where the mode has switched to Home but is still night in the cloud. and at about 06:10 this is switched back to Night mode. there is no smartapp or rule other than a single rule in rule machine that switches the mode to night mode, but it execution time i limited from 21:30 - 02:00 on that rule. looking a the events logs in the IDE i can see that rule switching to Night mode, like last night this occurred at just before 23:10, however the switch back to night mode this morning resulted in no logs showing this switch. but before it switched this morning back to night i looked at the IDE and this showed the hub was in night mode from 06:00 - 06:10, but the mode locally was showing as Home. once night mode was triggered again at 06:10 i manually changed the mode to Home and the cloud reflected this change within a few seconds.
please feel free to look for the root cause of my issue, i would love to work on it with the Right someone, but the UK support IMHO is pretty poor.
Regards
Stuart Buchanan