I started data mining my ST and here's what I found


(Jesse S) #1

Hi all! About 9 months ago I decided to start an experiment with my ST setup to see if I could do some data mining and learn anything interesting about our household habits. After experimenting with a few different platforms for data logging, I decided the best bet was to simply use IFTTT to store events into my Google Drive. From there, I was able to pull out the spreadsheet entries and import them into MATLAB, a scientific data analysis program. Once in MATLAB, it was literally one line of code to convert the spreadsheet data into a data matrix with all entries in an easy-to-analyze format.

For my first test case, I decided to look at the open/close events coming from our front door. I figured it might give some insight on how regular my schedule is going to/from work, or maybe shed some other interesting insights on daily activity around here. Let me throw up the plots, then I’ll explain below what they are and what I’ve learned (spoiler alert: don’t expect to be surprised)

The first plot shows all 5,266 events logged over the course of 9 months (corresponding to about 20 open/close events per day). I set the y-axis to be the day of the month that the event occurred, so it visually breaks up the individual months in a clean fashion. The first bit all the way at the beginning of the plot is back in August, and you’ll notice in September (the next line) there’s a couple week gap. At that point, I was still fiddling with my data logging options and had the IFTTT channel deactivated. Once I came back to IFTTT as my main tool for this project, the data goes unbroken all the way up until early May (the last little line all the way on the right). If you look closely at each of the months (each diagonal line is a month), you’ll notice that they’re a little jagged. That’s because on each day the door open & closed some pseudo-random number of times. The days we opened the door more frequently, there’s a longer horizontal dash. The days we opened the door less, it’s a bit shorter. One thing you can see is that in November and December the trend becomes less steep toward the end of the month, meaning the door was open/closed more times on a given day. Since this lines up with the holiday season, I suspect these trends come from the fact that we had company over, were home more instead of at work, and as a consequence, more active in our comings-and-goings. In Jan, Feb, and March, the data is much more linear with an essentially constant slope and is generally steeper than the holiday season events. Since it gets very cold here in Boston, I take this as empirical confirmation that we went out the bare minimum during those months!

The next two plots are histograms of the time that the door was opened & closed. The lower left plot is hours and the lower right is minutes. The histograms have been normalized and multiplied by 100 so we can read them as probabilities in percentage points. So for example, looking over the entire 9 month period, there’s about a 7% chance the door was opened at 9:00, a 7.5% chance it was opened at 10:00, a 5.5% chance it was opened at 11:00, etc etc. Here, I’m using a 24 hour scale to avoid any AM/PM nonsense. Two things jump out at me from this data. First, we are definitely not coming or going between 1 and 6 am. Second, the dog doesn’t get walked as regularly as I thought! Looking at the data, the uptick around 8 and 9 is typically when I leave for work. The peak at 10 is likely my wife walking the dog and going about her business. The second burst of activity between 18:00 and 21:00 (6:00 pm and 9:00 pm) is when we’re both home from work, going for walks with the dog, taking out the garbage, etc etc. My expectation was that these morning and evening peaks would be more pronounced with the afternoon’s having less activity, but I guess not!

For the third plot in the lower right, I decided to do a “control” and instead of looking at what hour the door was opened, what minute it was opened. For this, opening the door at 6:06 and 10:06 are equivalent since the specific value of the minutes are identical. My first thought was that I might see peaks in the 45 to 50s, thinking that maybe we leave 10 to 15 minutes early when we have to be somewhere at X o’clock sharp. Then I thought a bit harder and realized that our life just isn’t that regular and instead suspected the plot should be more flat. Looking at the data, we see that indeed, it’s basically flat. Sure there’s blips and bumps, but in general, everything hovers between 1.5 and 2% (note that 100% / 60 = 1.66%, which is exactly in the right range!). There’s a chance I could convince myself that activity steadily decreases for the first 35 minutes of the hour then increases as we get closer to the end of the hour, but the data’s pretty noisy and I’m a bit skeptical. So, what’s the take away on this? Our comings-and-goings are almost uniformly distributed throughout the hour!

Well, I was writing this up and then I realized that my analysis of the hours at which the door is opened is flawed – weekdays and weekends are intrinsically different for our lifestlye! Duh! So, I went back and separated out different datasets and here’s the breakdown:

Now things make more sense! On weekdays (blue curve) we have the morning activity around 8 to 10 when we’re heading out for the day, and again another large peak for the evening activity between 6 and 9. On weekends, however, there’s a much steadier pace of in-and-out throughout the day (orange curve). In the original plot of hourly activity, this lack of distinction mushed all the data together and suppressed the distinction. I think my big surprise is that we’re not more active later on the weekend evenings, but given that we’re parents of a young kid, I guess those days are behind us… And there’s evidence to prove it!

So what’s next? I started logging more devices in our ST setup. While this little project was fun and cute, I think there’s more serious analyses and applications for this type of data. For example, I think the next project will look for correlations between pairs of motion sensors, or motion sensors and open/close sensors. The idea is to ask questions like “if X door is opened at such-and-such time of day, can I predict if I’ll be going in or out of the room?” “How active are the cats at night?” and “how does household activity depend on who’s at home?”… Needless to say, I think there’s a lot of fun to be had! Any suggestions, questions, or comments are more then welcomed – this project is a work-in-progress!


IoT Logger [RELEASE]
Connecting SmartThings to my Lab's server
(Mike Maxwell) #2

super cool, I’ve been wanting to do this with motion sensor events…


(Jason "The Enabler" as deemed so by @Smart) #3

Post it on Google and all of your online advertising is going to become increasingly accurately targeted toward your personal likes and lifestyle.

… And now we know how ST makes their money… Only one line of code ? ? ? My oh my

@tgauchat did you see this?


(ActionTiles.com co-founder Terry @ActionTiles; GitHub: @cosmicpuppy) #4

Yup… And I’m pretty certain this is a good representation of how any low-priced smart home system will find data mining revenue to be lucrative and irresistible.


(Pizzinini) #5

This is realy cool… And shows the power of visualizing Smartthings data - and data in general. This is a very diligent approach.

I often something way less complicated but it helps me test the functionality and efficiency of apps using initialstate.com

Example 1 - Nest Manager with external temp sensor: left side is temp in room vs. thermostat without Nest Manager app, right side is with Nest Manager

Example 2 - Smart Bathroom fan app: checks humidity in the bathroom after taking a shower against 24hr rolling average.


(Jesse S) #6

@Mike_Maxwell – Yeah, that’s an excellent idea. I just started logging our motion sensors to see what I can mine from the data. I suspect that since they’re in individual rooms, the finer granularity will reveal some interesting habits about our lives. Honestly though, I’m a bit afraid to really see how much time is spent in the living room with the TV on…

@bamarayne – I think you’re hitting exactly the right point about how this data gets used downstream. I dug a bit deeper into the trends by breaking things up day-by-day and wrote about what I found below. The TLDR version – each day of the week has its own ebb and flow. Tuesdays seem much more regular than Fridays. We’re out later more on Friday/Saturday than Sunday. Sure most of this is semi-obvious, but the potential level of micro-targeting is remarkable. It’s really not that hard to imagine a company that makes turns this type of data into profits in the same way Google did with web traffic. Oh, and the one line of code is a MATLAB function that takes the spreadsheet in its native form of mixed strings and numbers, then converts it into a matrix array with each column containing a component of the full numeric representation of the date/time stamp. I guess it could also be done with regular expressions or a custom parser, but why reinvent the wheel when it’s already so easy?

@tgauchat The question I’d be asking is what demographic groups emerge and what behaviors are they distinguished by.

@pizzinini Thanks for the positive comments! It feels nice to know others are interested in this type of work!! The screenshots you posted are pretty cool – looks like this type of data is most useful for optimization of your HA setup. One of the interesting challenges I’ve found in tweaking each room’s configuration is that the greater variety of activities that happen in a room, the harder it becomes to get the various apps to function as intended. I guess this is probably a common observation among the folks that frequent this forum, but it always catches me off guard how an “overly smart” configuration can become painfully dumb at times. So far, the best approach I’ve found is to have one or two under-optimized configurations that require only minimal changes to our general behaviors.


And one more analysis to share:

After reflecting on the weekday/weekend breakdown, I decided to get a bit more granular and look at the day-by-day plots. Here again I’m putting the x-axis to be a 24 hour time period, and the y-axis to be the % of open/close events happening during a given hour. Each column corresponds to a day of the week (Sunday through Monday). In the first row (blue plots) I’m displaying the full 9 month averaged data. Monday, Tuesday and Wednesday have pretty clear morning/evening activity corresponding to leaving for work and general post-work activities. Thursday and Friday start to deviate and generally show more fluctuations. Saturday is pretty quiet before 10 am, but has a regular level of in-and-out until 10 pm. Sunday is distinct – we’re active for the first half of the day, but once we get in after 5 pm, we generally don’t go out much.

Continuing the discussion on micro-advertising and data monetization, here’s a bit of what I think you can get from this type of information. Suppose I were at a company that sold scheduling software or self-help books on how-to-manage-your-life, I’d be looking at Tuesdays and Fridays to focus my advertising. On Tuesday, there’s a very regimented and distinct pattern of behavior. On Friday (a workday!), things fluctuate like all hell. Some A-B testing would be in order to see exactly how this works best, but I think the idea would be to have two ads. One would focus on “taking your life to the next level” and the other would be “get your life on track”. The former would be pushed on Tuesdays to play off the psychological sense of reliability where the schedule seems more ‘tight’, and the latter would be pushed on Fridays to take advantage of the seeming irregularity. For the record, I’m not in marketing, so I don’t know if this would actually work, but that’s where the A-B testing would come into play and these ideas could be tested.

When I was looking at the blue data, I decided to go even further by getting one step more granular, and looking at the data corresponding to before and after the birth of our son. The timing works out nicely, and in fact splits the full data set almost exactly in half. The second row of plots with red data corresponds to pre-birth, and the third row of plots with black data to post-birth. Some interesting things jumped out here. For starters, my wife used to work the morning shift on Sundays and Mondays at the local vet hospital. This means she was often up-and-out around 6 am. The red data for Sunday/Monday definitely show these behaviors with peaks around 5 and 6 am. After our son was born, both of our schedules changed to accommodate child care responsibilities, and you can see it by the vanishing of these early-morning peaks in the black (post-birth) data. Interestingly, I’ll also point out that because those early morning shifts were so intense, Tuesdays was often her ‘recovery day’, which generally meant sleep. Lots of sleep. And in fact, if we look at the red data on Tuesday, we see there’s a lot less in-and-out during the day. The two very smooth peaks are predominantly a signature of my semi-9-5 work schedule. Now that my wife no longer works those two morning shifts, she’s around and active more on Tuesdays. The analysis highlights this lifestyle change by having less regular post-birth activity on Tuesdays. The interesting take-away here is that you don’t have to know that we had a son in order to tell that a major life change happened – the data on Sunday, Monday, and Tuesday tell us this story already.

There’s a few other lifestyle changes we’ve had lately with grandparents visiting on a more regular basis, but I’m not convinced I can pick it out yet. I tried narrowing down to just those time points, but it still fluctuates quite a bit due to low sampling. As time goes on, however, I’m sure we’ll see those trends more clearly.

Anyway, I’m going to continue to play with this data set a bit more and think about other ways to come at it. I have a few more things I’d like to check for…