Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting information with web scraping from multiple screen web page

Tags:

r

rvest

I am trying to get some information about enterprises from the Internet. Most of the information is located in this page: http://appscvs.supercias.gob.ec/portalInformacion/sector_societario.zul, the page looks like this:enter image description here

In this page I have to click on the tab Busqueda de Companias and then the interesting side starts. When I click I get the next screen:enter image description here In this page I have to set the option Nombre and then I have to insert a string with a name. For example I will add the string PROAÑO & ASOCIADOS CIA. LTDA. and I will get the next screen: enter image description here

Then, I have to click on Buscar and I will get the next screen:enter image description here

In this screen I have the information for this enterprise. Then, I have to click on the tab Informacion Estados Financieros and I will get the next screen: enter image description here

In this finally screen I have to click on the tab Estado Situacion and I will get the information from the enterprise in the columns Codigo de la cuenta contable, Nombre de la cuenta contable and Valor. I would like to get that information saved in a dataframe. Most of the complex side I found began when I have to set the element Nombre, insert a string, then Buscar and click until find the tab Informacion Estados Financieros. I have tried using html_session and html_form from rvest package but the elements are empty.

Could you help me with some steps to solve this problem?

like image 481
Duck Avatar asked May 03 '16 11:05

Duck


2 Answers

RSelenium Coded Example

Here is a self-contained code example, using the web-site referenced in the question.

Observation: Please do not run this code.

Why? Having 1k Stack users hit the web-site is a DDOS attack.


##Introduction Prerequisites

The code below will install RSelenium, before running the code you need to:

  1. Install Firefox
  2. Add the Selenium IDE Plugin
  • https://addons.mozilla.org/en-US/firefox/addon/selenium-ide/
  1. Install RStudio [Recommendation]
  2. Create a project and open the code file below

The code below will take you from the second page [http://appscvs.supercias.gob.ec/portaldeinformacion/consulta_cia_param.zul] through to the final page where the information you are interested in is...

Useful References:

If you are interested in using RSelenium I strongly recommend you read the following references, thanks go to John Harrison for developing the RSelenium package.

  • RSelenium Basics

http://rpubs.com/johndharrison/12843

  • RSelenium Headless Browsing

http://rpubs.com/johndharrison/RSelenium-headless

  • RSelenium Vignette

https://cran.r-project.org/web/packages/RSelenium/vignettes/basics.html

Code Example


# We want to make this as easy as possible to use
# So we need to install required packages for the user...
#
if (!require(RSelenium)) install.packages("RSelenium")
if (!require(XML)) install.packages("XML")
if (!require(RJSONIO)) install.packages("RSJONIO")
if (!require(stringr)) install.packages("stringr")
# Data
#
mainPage <- "http://appscvs.supercias.gob.ec/portalInformacion/sector_societario.zul"
businessPage <- "http://appscvs.supercias.gob.ec/portaldeinformacion/consulta_cia_param.zul"

# StartServer

# We assume RSelenium is not setup, so we check if the RSelenium
# server is available, if not we install RSelenium server.
checkForServer()

# OK. now we start the server
RSelenium::startServer()
remDr <- RSelenium::remoteDriver$new()

# We assume the user has installed Firefox and the Selenium IDE
# https://addons.mozilla.org/en-US/firefox/addon/selenium-ide/
#

# Ok we open firefix
remDr$open(silent = T) # Open up a firefox window...

# Now we open the browser and required URL...
# This is the page that matters...
remDr$navigate(businessPage)

# First things first on the first page, lets get the id's for the radio_button,
# name Element, and button. We need all three.
#
radioButton <- remDr$findElements(using = 'css selector', ".z-radio-cnt")
nameElement <- remDr$findElements(using = 'css selector', ".z-combobox-inp")
searchButton <- remDr$findElements(using = 'css selector', ".z-button-cm")

# Optional: we can highlight the radio elements returned
# lapply(radioButton, function(x){x$highlightElement()})
# Optional: we can highlight the nameElement returned
# lapply(nameElement, function(x){x$highlightElement()})
# Optional: we can highlight the searchButton returned
# lapply(searchButton, function(x){x$highlightElement()})

# Now we can select and press the third radio button
radioButton[[3]]$clickElement()
# We fill in the required name...
nameElement[[1]]$sendKeysToElement(list("PROAÑO & ASOCIADOS CIA. LTDA."))
# This is subtle but required the page triggers a drop down list, so rather than
# hitting the searchButton, we first select, and hit enter in the drop down menu...
selectElement <- remDr$findElements(using = 'css selector', ".z-comboitem-text")
selectElement[[1]]$clickElement()
# OK, now we can click the search button, which will cause the next page to open
searchButton[[1]]$clickElement()

# New Page opens...
#
# Ok, so now we first pull the list of buttons...
finPageButton <- remDr$findElements(using = 'class name', "m_iconos")
# Now we can press the required button to open the page we want to get too...
finPageButton[[9]]$clickElement()

# We are now on the required page.

we are now on the target page [See image]

Extracting the table values...

The next step is to extract the table values. To do this, we pull the .z-listitem css-selector data. Now we can check to confirm if we see the lines of data. We do, so we can now extract the values returned and populate either a list or Dataframe.

# Ok, now we need to extract the table, we identify and pull out the 
# '.z-listitem' and assign to modalWindow
modalWindow <- remDr$findElements(using = 'css selector', ".z-listitem")

# Now we can extract the lines from modalWindow... Now that each line is
# returned as a single line of text, so we split into three based on the
# line marker "/n'

lineText <- str_split(modalWindow[[1]]$getElementText()[1], '\n')
lineText

here, is the result:

> lineText <- stringr::str_split(modalWindow[[1]]$getElementText()[1], '\n')
> lineText
[[1]]
[1] "10"                                                                                                                                      
[2] "OPERACIONES DE INGRESO CON PARTES RELACIONADAS EN PARAÍSOS FISCALES, JURISDICCIONES DE MENOR IMPOSICIÓN Y REGÍMENES FISCALES PREFERENTES"
[3] "0.00"     

Dealing with Hidden Data.

The Selenium WebDriver and thus RSelenium only interact with visible elements of a web page. If we try to read the entire table, we will only return table items that are visible (unhidden).

We can navigate this issue by scrolling to the bottom of the table. We force the table to populate due to the scroll action. We can then extract the complete table.

# Select the .z-listbox-body

modalWindow <- remDr$findElements(using = 'css selector', ".z-listbox-body")

# Now we tell the window we want to scroll to the bottom of the table
# This triggers the table to populate all the rows

modalWindow[[1]]$executeScript("window.scrollTo(0, document.body.scrollHeight)")

# Now we can extract the complete table
modalWindow <- remDr$findElements(using = 'css selector', ".z-listitem")

lineText <- stringr::str_split(modalWindow[[9]]$getElementText(), '\n')
lineText

###What the code does.

The code example above is meant to be self-contained. By that I mean it should install everything you need including required packages. Once the dependent R packages install, the R code will call checkForServer(), if Selenium is not installed, the call will install it. This may take some time

My recommendation is you step through the code as I have not incorporated any delays (in production you would want to), note also I have not optimised for speed but rather for a modicum of clarity [from my perspective]...

The code was shown to work on:

  • Mac OS X 10.11.5
  • RStudio 0.99.893
  • R version 3.2.4 (2016-03-10) -- "Very Secure Dishes"

enter image description here

like image 148
Technophobe01 Avatar answered Oct 11 '22 11:10

Technophobe01


Check out RSelenium

  • First, install RSelenium and use the above linked vignette to get familiar with the basics

  • Then see this webinar on using RSelenium, which goes through some detailed scraping step-by-step and is quite easy to follow: http://johndharrison.blogspot.hk/2014/05/orange-county-r-users-group-oc-rug.html

like image 38
vijucat Avatar answered Oct 11 '22 11:10

vijucat