Website Parsing via HTTPGET?

pstuart · August 9, 2014, 10:01pm

So I’m trying to parse a website, after successfully logging in I get the response.data back, but since it isn’t an XML or JSON page, it is HTML, I’m struggling with how to use the XMLslurper object that comes back to search for the divs I’m trying to extract the text out of.

Anyone have any thoughts or ideas on how to do this?

FYI, the resulting webpage is very large, so it doesn’t show up in the log.debug, is it possible there is some upper limit on size of returned result?

Dianoga · August 10, 2014, 4:37pm

If things are too large it won’t log properly (I believe the limit is 100, but I’m not sure). The data should still be usable though.

Haven’t used xmlslurper so I can’t do much for you there.

viguera · August 11, 2014, 4:06am

A little light on the details, but I guess because you’re required to sign in it’s some kind of personal stuff?

Have you checked if someone is already implementing something similar using APIfy or one of the similar tools?

pstuart · August 11, 2014, 1:13pm

Yeah, most examples assume an API returning data either XML or JSON.

ST parses the HTML into XML and I can walk the content tree…

response.data.children().each { log.debug it.name } results in HTML and going deeper into each branch I can see all the DIV’s

But when I try to access text() of the elements, I get nothing.

Also, can’t search on @class which is really what I want to do. It’s like ST is truncating the content of a large website result?

viguera · August 11, 2014, 1:36pm

stupid question, but can you try using a different URL as the source? Perhaps a smaller web page so you can rule out the size issue?

pstuart · August 11, 2014, 10:56pm

Yeah, smalle page like google functions properly.

scottinpollock · August 12, 2014, 12:14am

Another thought… are Google’s tags well formed and the other page’s not? Pretty common on the web nowadays to not be.

eparkerjr · August 13, 2014, 8:20am

If you just keep the result as text, can you use the find or findAll function together with a regular expression to extract what you are looking for? This would be assuming that you can single out the div and that we are able to use regular expressions with SmartThings.

pstuart · August 14, 2014, 1:17am

Wow, this has been incredibly painful but I got it…

What isn’t working / security restricted…
-Regex matcher (WTF?)
-Accessing the text() element of a childnode of a long xmlslurper element

What I had to do was find the exact block (DIV) that I wanted to parse:
doc[0].children[1].children[5].children[1].writeTo(x)

where x is a StringWriter object

Then convert the stringwriter to a string, split it into lines

def linesAsList = x.toString().minus(" ").split( /\n|\r|\n\r|\r\n/ ).collect { it.replace( "'", '' ) }

Then clean up each line that contained the data I was looking for, then map that to device variables…

That should work, as long as the host doesn’t change the formatting.

This is a true hack… I’ll reveal what I’m doing shortly, but really upset it took this much to get simple data from a website / IoT device… But it opens the door to scrape pretty much any site.

email_cslee · January 29, 2015, 8:21am

Can you share how you did this?

pstuart · January 29, 2015, 2:07pm

What isn’t clear? I can’t walk you through step by step but if you have a specific question feel free to ask.

email_cslee · January 29, 2015, 10:12pm

Do you have code you can share?
Is it a smartapp or a device? I want to access a http link and get a status from the source of the html page.

pstuart · January 29, 2015, 10:45pm

Entire thread about it and code samples. Thought this was that thread.

email_cslee · January 31, 2015, 12:54am

I’m trying to use this method to parse a website for the EVL-3 alarm interface.
Where do you get the information for the header?

    uri: 'https://portal.theubi.com/login.do',
    headers: [
    'Content-Type': 'application/x-www-form-urlencoded',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Encoding': 'sdch',
    'Host': 'portal.theubi.com',
    'DNT': '1',
    'Origin': 'https://portal.theubi.com/login.jsp',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.95 Safari/537.36'
        ],

pstuart · January 31, 2015, 3:04pm

Packet sniffer. But dev tools on most browsers will show the headers

Topic		Replies	Views
XML Manipulation Devices & Integrations	6	1487	January 12, 2018
Parsing XML within Smartapp/Device General Discussion archive_developers	4	2468	August 30, 2018
httpGet General Discussion archive_developers	2	1398	June 25, 2013
httpGet response truncated General Discussion developers	4	803	February 13, 2017
Getting Ubi Sensor Data the Hard Way... Or how I stopped worrying and learned to love the XMLslurper Connected Things	55	7411	January 23, 2015

Website Parsing via HTTPGET?

Related topics