pstuart
(Patrick Stuart [@pstuart])
August 9, 2014, 10:01pm
1
So I’m trying to parse a website, after successfully logging in I get the response.data back, but since it isn’t an XML or JSON page, it is HTML, I’m struggling with how to use the XMLslurper object that comes back to search for the divs I’m trying to extract the text out of.
Anyone have any thoughts or ideas on how to do this?
FYI, the resulting webpage is very large, so it doesn’t show up in the log.debug, is it possible there is some upper limit on size of returned result?
Dianoga
(Brian Steere)
August 10, 2014, 4:37pm
2
If things are too large it won’t log properly (I believe the limit is 100, but I’m not sure). The data should still be usable though.
Haven’t used xmlslurper so I can’t do much for you there.
A little light on the details, but I guess because you’re required to sign in it’s some kind of personal stuff?
Have you checked if someone is already implementing something similar using APIfy or one of the similar tools?
pstuart
(Patrick Stuart [@pstuart])
August 11, 2014, 1:13pm
4
Yeah, most examples assume an API returning data either XML or JSON.
ST parses the HTML into XML and I can walk the content tree…
response.data.children().each { log.debug it.name } results in HTML and going deeper into each branch I can see all the DIV’s
But when I try to access text() of the elements, I get nothing.
Also, can’t search on @class which is really what I want to do. It’s like ST is truncating the content of a large website result?
stupid question, but can you try using a different URL as the source? Perhaps a smaller web page so you can rule out the size issue?
pstuart
(Patrick Stuart [@pstuart])
August 11, 2014, 10:56pm
6
Yeah, smalle page like google functions properly.
Another thought… are Google’s tags well formed and the other page’s not? Pretty common on the web nowadays to not be.
If you just keep the result as text, can you use the find or findAll function together with a regular expression to extract what you are looking for? This would be assuming that you can single out the div and that we are able to use regular expressions with SmartThings.
pstuart
(Patrick Stuart [@pstuart])
August 14, 2014, 1:17am
9
Wow, this has been incredibly painful but I got it…
What isn’t working / security restricted…
-Regex matcher (WTF?)
-Accessing the text() element of a childnode of a long xmlslurper element
What I had to do was find the exact block (DIV) that I wanted to parse:
doc[0].children[1].children[5].children[1].writeTo(x)
where x is a StringWriter object
Then convert the stringwriter to a string, split it into lines
def linesAsList = x.toString().minus(" ").split( /\n|\r|\n\r|\r\n/ ).collect { it.replace( "'", '' ) }
Then clean up each line that contained the data I was looking for, then map that to device variables…
That should work, as long as the host doesn’t change the formatting.
This is a true hack… I’ll reveal what I’m doing shortly, but really upset it took this much to get simple data from a website / IoT device… But it opens the door to scrape pretty much any site.
1 Like
Can you share how you did this?
pstuart
(Patrick Stuart [@pstuart])
January 29, 2015, 2:07pm
11
What isn’t clear? I can’t walk you through step by step but if you have a specific question feel free to ask.
Do you have code you can share?
Is it a smartapp or a device? I want to access a http link and get a status from the source of the html page.
pstuart
(Patrick Stuart [@pstuart])
January 29, 2015, 10:45pm
13
https://github.com/pstuart/smartthings/blob/master/Get%20Ubi%20Sensors
Yeah, this ain’t pretty, but it is a proof of concept on how to get a web page data and parse it.
Someone might want to refine this a bit, but I wanted to get it out there, and see if we can force the Ubi guys to make a much more official way to get this data.
Anyway, its ugly, its not going to work for everyone, could break at any time… But it is proof that you can grab almost any data off a website and use it for a devicetype…
Enjoy…
Entire thread about it and code samples. Thought this was that thread.
1 Like
I’m trying to use this method to parse a website for the EVL-3 alarm interface.
Where do you get the information for the header?
uri: 'https://portal.theubi.com/login.do',
headers: [
'Content-Type': 'application/x-www-form-urlencoded',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding': 'sdch',
'Host': 'portal.theubi.com',
'DNT': '1',
'Origin': 'https://portal.theubi.com/login.jsp',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.95 Safari/537.36'
],
pstuart
(Patrick Stuart [@pstuart])
January 31, 2015, 3:04pm
15
Packet sniffer. But dev tools on most browsers will show the headers