Website Parsing via HTTPGET?


(Patrick Stuart [@pstuart]) #1

So I’m trying to parse a website, after successfully logging in I get the response.data back, but since it isn’t an XML or JSON page, it is HTML, I’m struggling with how to use the XMLslurper object that comes back to search for the divs I’m trying to extract the text out of.

Anyone have any thoughts or ideas on how to do this?

FYI, the resulting webpage is very large, so it doesn’t show up in the log.debug, is it possible there is some upper limit on size of returned result?


(Brian Steere) #2

If things are too large it won’t log properly (I believe the limit is 100, but I’m not sure). The data should still be usable though.

Haven’t used xmlslurper so I can’t do much for you there.


(Amauri Viguera) #3

A little light on the details, but I guess because you’re required to sign in it’s some kind of personal stuff? :smile:

Have you checked if someone is already implementing something similar using APIfy or one of the similar tools?


(Patrick Stuart [@pstuart]) #4

Yeah, most examples assume an API returning data either XML or JSON.

ST parses the HTML into XML and I can walk the content tree…

response.data.children().each { log.debug it.name } results in HTML and going deeper into each branch I can see all the DIV’s

But when I try to access text() of the elements, I get nothing.

Also, can’t search on @class which is really what I want to do. It’s like ST is truncating the content of a large website result?


(Amauri Viguera) #5

stupid question, but can you try using a different URL as the source? Perhaps a smaller web page so you can rule out the size issue?


(Patrick Stuart [@pstuart]) #6

Yeah, smalle page like google functions properly.


(Convinced ST will never be unbroken…) #7

Another thought… are Google’s tags well formed and the other page’s not? Pretty common on the web nowadays to not be.


(skp19) #8

If you just keep the result as text, can you use the find or findAll function together with a regular expression to extract what you are looking for? This would be assuming that you can single out the div and that we are able to use regular expressions with SmartThings.


(Patrick Stuart [@pstuart]) #9

Wow, this has been incredibly painful but I got it…

What isn’t working / security restricted…
-Regex matcher (WTF?)
-Accessing the text() element of a childnode of a long xmlslurper element

What I had to do was find the exact block (DIV) that I wanted to parse:
doc[0].children[1].children[5].children[1].writeTo(x)

where x is a StringWriter object

Then convert the stringwriter to a string, split it into lines

def linesAsList = x.toString().minus(" ").split( /\n|\r|\n\r|\r\n/ ).collect { it.replace( "'", '' ) }

Then clean up each line that contained the data I was looking for, then map that to device variables…

That should work, as long as the host doesn’t change the formatting.

This is a true hack… I’ll reveal what I’m doing shortly, but really upset it took this much to get simple data from a website / IoT device… But it opens the door to scrape pretty much any site.


(Cs Lee) #10

Can you share how you did this?


(Patrick Stuart [@pstuart]) #11

What isn’t clear? I can’t walk you through step by step but if you have a specific question feel free to ask.


(Cs Lee) #12

Do you have code you can share?
Is it a smartapp or a device? I want to access a http link and get a status from the source of the html page.


(Patrick Stuart [@pstuart]) #13

Entire thread about it and code samples. Thought this was that thread.


(Cs Lee) #14

I’m trying to use this method to parse a website for the EVL-3 alarm interface.
Where do you get the information for the header?

    uri: 'https://portal.theubi.com/login.do',
    headers: [
    'Content-Type': 'application/x-www-form-urlencoded',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Encoding': 'sdch',
    'Host': 'portal.theubi.com',
    'DNT': '1',
    'Origin': 'https://portal.theubi.com/login.jsp',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.95 Safari/537.36'
        ],

(Patrick Stuart [@pstuart]) #15

Packet sniffer. But dev tools on most browsers will show the headers