As an intern in an economic research team, I was given the task to find a way to automatically collect specific data on a real estate ad website, using R.
I assume that the concerned packages are XML
and RCurl
, but my understanding of their work is very limited.
Here is the main page of the website: http://www.leboncoin.fr/ventes_immobilieres/offres/nord_pas_de_calais/?f=a&th=1&zz=59000 Ideally, I'd like to construct my database so that each row corresponds to an ad.
Here is the detail of an ad: http://www.leboncoin.fr/ventes_immobilieres/197284216.htm?ca=17_s My variables are: the price ("Prix"), the city ("Ville"), the surface ("surface"), the "GES, the "Classe énergie" and the number of room ("Pièces"), as well as the number of pictures shown in the ad. I would also like to export the text in a character vector over which I would perform a text mining analysis later on.
I'm looking for any help, link to a tutorial or How-to that would give me a lead over the path to follow.
You can use the XML
package in R to scrape this data. Here is a piece of code that should help.
# DEFINE UTILITY FUNCTIONS
# Function to Get Links to Ads by Page
get_ad_links = function(page){
require(XML)
# construct url to page
url_base = "http://www.leboncoin.fr/ventes_immobilieres/offres/nord_pas_de_calais/"
url = paste(url_base, "?o=", page, "&zz=", 59000, sep = "")
page = htmlTreeParse(url, useInternalNodes = T)
# extract links to ads on page
xp_exp = "//td/a[contains(@href, 'ventes_immobilieres')]"
ad_links = xpathSApply(page, xp_exp, xmlGetAttr, "href")
return(ad_links)
}
# Function to Get Ad Details by Ad URL
get_ad_details = function(ad_url){
require(XML)
# parse ad url to html tree
doc = htmlTreeParse(ad_url, useInternalNodes = T)
# extract labels and values using xpath expression
labels = xpathSApply(doc, "//span[contains(@class, 'ad')]/label", xmlValue)
values1 = xpathSApply(doc, "//span[contains(@class, 'ad')]/strong", xmlValue)
values2 = xpathSApply(doc, "//span[contains(@class, 'ad')]//a", xmlValue)
values = c(values1, values2)
# convert to data frame and add labels
mydf = as.data.frame(t(values))
names(mydf) = labels
return(mydf)
}
Here is how you would use these functions to extract information into a data frame.
# grab ad links from page 1
ad_links = get_ad_links(page = 1)
# grab ad details for first 5 links from page 1
require(plyr)
ad_details = ldply(ad_links[1:5], get_ad_details, .progress = 'text')
This returns the following output
Prix : Ville : Frais d'agence inclus : Type de bien : Pièces : Surface : Classe énergie : GES :
469 000 € 59000 Lille Oui Maison 8 250 m2 F (de 331 à 450) <NA>
469 000 € 59000 Lille Oui Maison 8 250 m2 F (de 331 à 450) <NA>
140 000 € 59000 Lille <NA> Appartement 2 50 m2 D (de 151 à 230) E (de 36 à 55)
140 000 € 59000 Lille <NA> Appartement 2 50 m2 D (de 151 à 230) E (de 36 à 55)
170 000 € 59000 Lille <NA> Appartement <NA> 50 m2 D (de 151 à 230) D (de 21 à 35)
You can easily use the apply
family of functions to loop over multiple pages to get details of all ads. Two things to be mindful of. One is the legality of scraping from the website. Two is to use Sys.sleep
in your looping function so that the servers are not bombarded with requests.
Let me know how this works
That's quite a big question, so you need to break it down into smaller ones, and see which bits you get stuck on.
Is the problem with retrieving a web page? (Watch out for proxy server issues.) Or is the tricky bit accessing the useful bits of data from it? (You'll probably need to use xPath for this.)
Take a look at the web-scraping example on Rosetta code and browse these SO questions for more information.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With