Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Setting up import.io crawler with xpath or regexp

I am currently trying to set up a web crawler to extract data from real estate web sites. It is a trend with these sites that certain info is not in the same place for each page, so I must figure out how to have it extract text elements that contain certain phrases rather than based on where they are. Here are some examples of such pages:

http://www.zillow.com/homedetails/2630-Hazy-Creek-Dr-Houston-TX-77084/28388488_zpid/

http://www.zillow.com/homedetails/16514-Park-Firth-Dr-Houston-TX-77084/28357799_zpid/

Notice how certain info such as the MLS # appears in different spots. When I extract the xpath from one of these entries, I get: //*[@id="yui_3_15_0_1_1435013689406_3296"], and since I'm not too familiar with xpath, I don't know how to alter it to look for some phrase (I've certainly tried, but it never works out). Regexp seems like a promising tool, but when I use the command ^MLS, which should look for elements beginning with "MLS", it simply doesn't work. I know there must be a straightforward way to do this, but this is my first time using this service so I'm not too familiar with it yet. And advice would be much appreciated.

like image 964
user2480757 Avatar asked Jun 22 '15 23:06

user2480757


1 Answers

Regex doesn't allow you to extract data, only to clean or modify already extracted text.

You need to create an XPath to extract the data you want. I've done one for you as example:

//*[@role="main"]//li[contains(text(), "MLS ")]

Explanation: that looks for the main section of the page and then search a <li> that contains the text "MLS". That will extract something like "MLS #: 66521347"

You can now select column type as "number" to get only the number (you could to this also with regex, that's precisely the kind of things you can do with it).

EDIT: Even though that XPath is correct, it doesn't return the data in import.io. There is another way to do it: Using an XPath to bring all the text in that section and then using regex to select the MLS.

XPath to use:

//*[@role="main"]/section[@class="zsg-content-section "][1]

Regex to use:

MLS #: (\d+)
like image 175
ignacioelola Avatar answered Sep 25 '22 06:09

ignacioelola