Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R web scraper with jsessionid

I'm testing some web scrape scripts in R. I've read many tutorials, docs and tried different things but no success so far.

The URL I'm trying to scrape is this one. It has public, government data, and no statements against web scrapers. It's in Portuguese, but I believe it won't be a big problem.

It shows a search form, with several fields. My test was searching for data from a particular state ("RJ", in this case the field is "UF"), and city ("Rio de Janeiro", in the field "MUNICIPIO"). By clicking "Pesquisar" (Search), it shows the following output:

enter image description here

Using Firebug, I found the URL it calls (using the parameters above) is:

http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/home.seam?buscaForm=buscaForm&codEntidadeDecorate%3AcodEntidadeInput=&noEntidadeDecorate%3AnoEntidadeInput=&descEnderecoDecorate%3AdescEnderecoInput=&estadoDecorate%3A**estadoSelect=33**&municipioDecorate%3A**municipioSelect=3304557**&bairroDecorate%3AbairroInput=&pesquisar.x=42&pesquisar.y=16&javax.faces.ViewState=j_id10

The site uses a jsessionid, as can be seen using the following:

library(rvest)
library(httr)
url <- GET("http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/")
cookies(url)

Knowing it uses a jsessionid, I used cookies(url) to check this info, and used it into a new URL like this:

url <- read_html("http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/home.seam;jsessionid=008142964577DBEC622E6D0C8AF2F034?buscaForm=buscaForm&codEntidadeDecorate%3AcodEntidadeInput=33108064&noEntidadeDecorate%3AnoEntidadeInput=&descEnderecoDecorate%3AdescEnderecoInput=&estadoDecorate%3AestadoSelect=org.jboss.seam.ui.NoSelectionConverter.noSelectionValue&bairroDecorate%3AbairroInput=&pesquisar.x=65&pesquisar.y=8&javax.faces.ViewState=j_id2")
html_text(url)

Well, the output doesn't have the data. In fact, it has a error message. Translated into English, it basically says the session was expired.

I assume it is a basic mistake, but I looked all around and couldn't find a way to overcome this.

like image 853
Ricardo Costa Avatar asked Oct 18 '22 19:10

Ricardo Costa


1 Answers

This combination worked for me:

library(curl)
library(xml2)
library(httr)
library(rvest)
library(stringi)

# warm up the curl handle
start <- GET("http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/home.seam")

# get the cookies
ck <- handle_cookies(handle_find("http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/home.seam")$handle)

# make the POST request
res <- POST("http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/home.seam;jsessionid=" %s+% ck[1,]$value,
            user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:40.0) Gecko/20100101 Firefox/40.0"),
            accept("*/*"),
            encode="form",
            multipart=FALSE, # this gens a warning but seems to be necessary
            add_headers(Referer="http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/home.seam"),
            body=list(`buscaForm`="buscaForm",
                      `codEntidadeDecorate:codEntidadeInput`="",
                      `noEntidadeDecorate:noEntidadeInput`="",
                      `descEnderecoDecorate:descEnderecoInput`="",
                      `estadoDecorate:estadoSelect`=33,
                      `municipioDecorate:municipioSelect`=3304557,
                      `bairroDecorate:bairroInput`="",
                      `pesquisar.x`=50,
                      `pesquisar.y`=15,
                      `javax.faces.ViewState`="j_id1"))

doc <- read_html(content(res, as="text"))

html_nodes(doc, "table")
## {xml_nodeset (5)}
## [1] <table border="0" cellpadding="0" cellspacing="0" class="rich-tabpanel " id="j_id17" sty ...
## [2] <table border="0" cellpadding="0" cellspacing="0">\n  <tr>\n    <td>\n      <img alt=""  ...
## [3] <table border="0" cellpadding="0" cellspacing="0" id="j_id18_shifted" onclick="if (RichF ...
## [4] <table border="0" cellpadding="0" cellspacing="0" style="height: 100%; width: 100%;">\n  ...
## [5] <table border="0" cellpadding="10" cellspacing="0" class="dr-tbpnl-cntnt-pstn rich-tabpa ...

I used BurpSuite to inspect what was going on and did a quick test at the command line with the output from "Copy as cURL" and adding --verbose to I could validate what was being sent/received. I then mimicked the curl parameters.

By starting at the bare search page, the cookies for the session id and the bigip server are already warmed up (i.e. will be sent with every request so you don't have to mess with them) BUT you still need to fill it in on the URL path so we have to retrieve them, then fill it in.

like image 150
hrbrmstr Avatar answered Oct 21 '22 15:10

hrbrmstr