I'm testing some web scrape scripts in R. I've read many tutorials, docs and tried different things but no success so far.
The URL I'm trying to scrape is this one. It has public, government data, and no statements against web scrapers. It's in Portuguese, but I believe it won't be a big problem.
It shows a search form, with several fields. My test was searching for data from a particular state ("RJ", in this case the field is "UF"), and city ("Rio de Janeiro", in the field "MUNICIPIO"). By clicking "Pesquisar" (Search), it shows the following output:
Using Firebug, I found the URL it calls (using the parameters above) is:
http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/home.seam?buscaForm=buscaForm&codEntidadeDecorate%3AcodEntidadeInput=&noEntidadeDecorate%3AnoEntidadeInput=&descEnderecoDecorate%3AdescEnderecoInput=&estadoDecorate%3A**estadoSelect=33**&municipioDecorate%3A**municipioSelect=3304557**&bairroDecorate%3AbairroInput=&pesquisar.x=42&pesquisar.y=16&javax.faces.ViewState=j_id10
The site uses a jsessionid, as can be seen using the following:
library(rvest)
library(httr)
url <- GET("http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/")
cookies(url)
Knowing it uses a jsessionid, I used cookies(url) to check this info, and used it into a new URL like this:
url <- read_html("http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/home.seam;jsessionid=008142964577DBEC622E6D0C8AF2F034?buscaForm=buscaForm&codEntidadeDecorate%3AcodEntidadeInput=33108064&noEntidadeDecorate%3AnoEntidadeInput=&descEnderecoDecorate%3AdescEnderecoInput=&estadoDecorate%3AestadoSelect=org.jboss.seam.ui.NoSelectionConverter.noSelectionValue&bairroDecorate%3AbairroInput=&pesquisar.x=65&pesquisar.y=8&javax.faces.ViewState=j_id2")
html_text(url)
Well, the output doesn't have the data. In fact, it has a error message. Translated into English, it basically says the session was expired.
I assume it is a basic mistake, but I looked all around and couldn't find a way to overcome this.
This combination worked for me:
library(curl)
library(xml2)
library(httr)
library(rvest)
library(stringi)
# warm up the curl handle
start <- GET("http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/home.seam")
# get the cookies
ck <- handle_cookies(handle_find("http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/home.seam")$handle)
# make the POST request
res <- POST("http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/home.seam;jsessionid=" %s+% ck[1,]$value,
user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:40.0) Gecko/20100101 Firefox/40.0"),
accept("*/*"),
encode="form",
multipart=FALSE, # this gens a warning but seems to be necessary
add_headers(Referer="http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/home.seam"),
body=list(`buscaForm`="buscaForm",
`codEntidadeDecorate:codEntidadeInput`="",
`noEntidadeDecorate:noEntidadeInput`="",
`descEnderecoDecorate:descEnderecoInput`="",
`estadoDecorate:estadoSelect`=33,
`municipioDecorate:municipioSelect`=3304557,
`bairroDecorate:bairroInput`="",
`pesquisar.x`=50,
`pesquisar.y`=15,
`javax.faces.ViewState`="j_id1"))
doc <- read_html(content(res, as="text"))
html_nodes(doc, "table")
## {xml_nodeset (5)}
## [1] <table border="0" cellpadding="0" cellspacing="0" class="rich-tabpanel " id="j_id17" sty ...
## [2] <table border="0" cellpadding="0" cellspacing="0">\n <tr>\n <td>\n <img alt="" ...
## [3] <table border="0" cellpadding="0" cellspacing="0" id="j_id18_shifted" onclick="if (RichF ...
## [4] <table border="0" cellpadding="0" cellspacing="0" style="height: 100%; width: 100%;">\n ...
## [5] <table border="0" cellpadding="10" cellspacing="0" class="dr-tbpnl-cntnt-pstn rich-tabpa ...
I used BurpSuite to inspect what was going on and did a quick test at the command line with the output from "Copy as cURL" and adding --verbose
to I could validate what was being sent/received. I then mimicked the curl
parameters.
By starting at the bare search page, the cookies for the session id and the bigip server are already warmed up (i.e. will be sent with every request so you don't have to mess with them) BUT you still need to fill it in on the URL path so we have to retrieve them, then fill it in.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With