If they’re trying to make a fully resilient platform, wouldn’t they be better using both AWS and Azure for instance, so if one goes down the other can recover?
I know this has been done by some enterprises and has been proven to work
I’m very disappointed in all of you for leading this person on. Using your tech jargin to make things sound fancy so that it’s actually believable. @kcm117 please do not fall for this trick they are trying to pull on you. Everyone knows that the cloud Smartthings uses is Cirrocumulus. This is the high level cloud where all data is originally stored. If Cirrocumulus starts to bog down with information is it quickly spread out over Cirrus and Cirrostratus. When information is needed to be sent back to your hub, it travels through one of Altostratus, Altocumulus or Nimbostratus. Altocumulus is the most common mid cloud, more than one layer of Altocumulus often appears at different levels at the same time. Many times Altocumulus will appear with other cloud types. After the data is processed through one of these clouds, it then travels though one of the clouds closer to your hub which are called Cumulus, Stratus, Cumulonimbus or Stratocumulus. If you are ever having issues with your data, it’s probably because it went through Cumulonimbus. That cloud is the one that causes the most disruption amongst people. It can get very scary when stuck in that cloud. Everything starts spinning, giving you the feeling your data is caught up in a tornado.
Cassandra was a big driver of issues earlier this year - though this has been largely mitigated. Hopefully the community in general has noticed the platform has been more reliable in general since March - which is the result of an organizational wide focus on solving those problems. You are correct in that AWS stability has rarely (no recent incidents come to mind) been a cause of downtime.
In recent memory - (high level) causes have been:
Caching failures which cause stress on our relational database (This was the really bad downtime issue that you quoted me on)
Network Connectivity failures with the services that connect to Hubs (Hubs going offline) - a lot of effort being put into finding root causes (Probably not fair to mark this as a single issue but I’m not sure how many details I’m allowed to share here )
Deployments. There have been some pretty drastic architectural changes to facilitate performance improvements and bug fixes. Occasionally some unforeseen bugs creep into Production but these have largely been very localized (think IDE only issues or affecting a small subset of users).
There are of course out standing issues out there that impact users on a day to day basis - engineering communicates with support frequently to get reports on what issues their team is fielding the most, which helps drive our prioritization. This is probably why Tim and Jody are constantly reminding everyone to contact support when they run into problems .