I am trying to get some information about enterprises from the Internet. Most of the information is located in this page: http://appscvs.supercias.gob.ec/portalInformacion/sector_societario.zul, the page looks like this:
In this page I have to click on the tab Busqueda de Companias
and then the interesting side starts. When I click I get the next screen:
In this page I have to set the option Nombre
and then I have to insert a string with a name. For example I will add the string PROAÑO & ASOCIADOS CIA. LTDA.
and I will get the next screen:
Then, I have to click on Buscar
and I will get the next screen:
In this screen I have the information for this enterprise. Then, I have to click on the tab Informacion Estados Financieros
and I will get the next screen:
In this finally screen I have to click on the tab Estado Situacion
and I will get the information from the enterprise in the columns Codigo de la cuenta contable
, Nombre de la cuenta contable
and Valor
. I would like to get that information saved in a dataframe. Most of the complex side I found began when I have to set the element Nombre
, insert a string, then Buscar
and click until find the tab Informacion Estados Financieros
. I have tried using html_session
and html_form
from rvest
package but the elements are empty.
Could you help me with some steps to solve this problem?
Here is a self-contained code example, using the web-site referenced in the question.
Why? Having 1k Stack users hit the web-site is a DDOS attack.
##Introduction Prerequisites
The code below will install RSelenium, before running the code you need to:
The code below will take you from the second page [http://appscvs.supercias.gob.ec/portaldeinformacion/consulta_cia_param.zul] through to the final page where the information you are interested in is...
If you are interested in using RSelenium I strongly recommend you read the following references, thanks go to John Harrison for developing the RSelenium package.
- RSelenium Basics
http://rpubs.com/johndharrison/12843
- RSelenium Headless Browsing
http://rpubs.com/johndharrison/RSelenium-headless
- RSelenium Vignette
https://cran.r-project.org/web/packages/RSelenium/vignettes/basics.html
# We want to make this as easy as possible to use
# So we need to install required packages for the user...
#
if (!require(RSelenium)) install.packages("RSelenium")
if (!require(XML)) install.packages("XML")
if (!require(RJSONIO)) install.packages("RSJONIO")
if (!require(stringr)) install.packages("stringr")
# Data
#
mainPage <- "http://appscvs.supercias.gob.ec/portalInformacion/sector_societario.zul"
businessPage <- "http://appscvs.supercias.gob.ec/portaldeinformacion/consulta_cia_param.zul"
# StartServer
# We assume RSelenium is not setup, so we check if the RSelenium
# server is available, if not we install RSelenium server.
checkForServer()
# OK. now we start the server
RSelenium::startServer()
remDr <- RSelenium::remoteDriver$new()
# We assume the user has installed Firefox and the Selenium IDE
# https://addons.mozilla.org/en-US/firefox/addon/selenium-ide/
#
# Ok we open firefix
remDr$open(silent = T) # Open up a firefox window...
# Now we open the browser and required URL...
# This is the page that matters...
remDr$navigate(businessPage)
# First things first on the first page, lets get the id's for the radio_button,
# name Element, and button. We need all three.
#
radioButton <- remDr$findElements(using = 'css selector', ".z-radio-cnt")
nameElement <- remDr$findElements(using = 'css selector', ".z-combobox-inp")
searchButton <- remDr$findElements(using = 'css selector', ".z-button-cm")
# Optional: we can highlight the radio elements returned
# lapply(radioButton, function(x){x$highlightElement()})
# Optional: we can highlight the nameElement returned
# lapply(nameElement, function(x){x$highlightElement()})
# Optional: we can highlight the searchButton returned
# lapply(searchButton, function(x){x$highlightElement()})
# Now we can select and press the third radio button
radioButton[[3]]$clickElement()
# We fill in the required name...
nameElement[[1]]$sendKeysToElement(list("PROAÑO & ASOCIADOS CIA. LTDA."))
# This is subtle but required the page triggers a drop down list, so rather than
# hitting the searchButton, we first select, and hit enter in the drop down menu...
selectElement <- remDr$findElements(using = 'css selector', ".z-comboitem-text")
selectElement[[1]]$clickElement()
# OK, now we can click the search button, which will cause the next page to open
searchButton[[1]]$clickElement()
# New Page opens...
#
# Ok, so now we first pull the list of buttons...
finPageButton <- remDr$findElements(using = 'class name', "m_iconos")
# Now we can press the required button to open the page we want to get too...
finPageButton[[9]]$clickElement()
# We are now on the required page.
we are now on the target page [See image]
The next step is to extract the table values. To do this, we pull the .z-listitem
css-selector
data. Now we can check to confirm if we see the lines of data. We do, so we can now extract the values returned and populate either a list or Dataframe.
# Ok, now we need to extract the table, we identify and pull out the
# '.z-listitem' and assign to modalWindow
modalWindow <- remDr$findElements(using = 'css selector', ".z-listitem")
# Now we can extract the lines from modalWindow... Now that each line is
# returned as a single line of text, so we split into three based on the
# line marker "/n'
lineText <- str_split(modalWindow[[1]]$getElementText()[1], '\n')
lineText
here, is the result:
> lineText <- stringr::str_split(modalWindow[[1]]$getElementText()[1], '\n')
> lineText
[[1]]
[1] "10"
[2] "OPERACIONES DE INGRESO CON PARTES RELACIONADAS EN PARAÍSOS FISCALES, JURISDICCIONES DE MENOR IMPOSICIÓN Y REGÍMENES FISCALES PREFERENTES"
[3] "0.00"
The Selenium WebDriver and thus RSelenium only interact with visible elements of a web page. If we try to read the entire table, we will only return table items that are visible (unhidden).
We can navigate this issue by scrolling to the bottom of the table. We force the table to populate due to the scroll action. We can then extract the complete table.
# Select the .z-listbox-body
modalWindow <- remDr$findElements(using = 'css selector', ".z-listbox-body")
# Now we tell the window we want to scroll to the bottom of the table
# This triggers the table to populate all the rows
modalWindow[[1]]$executeScript("window.scrollTo(0, document.body.scrollHeight)")
# Now we can extract the complete table
modalWindow <- remDr$findElements(using = 'css selector', ".z-listitem")
lineText <- stringr::str_split(modalWindow[[9]]$getElementText(), '\n')
lineText
###What the code does.
The code example above is meant to be self-contained. By that I mean it should install everything you need including required packages. Once the dependent R packages install, the R code will call checkForServer()
, if Selenium is not installed, the call will install it. This may take some time
My recommendation is you step through the code as I have not incorporated any delays (in production you would want to), note also I have not optimised for speed but rather for a modicum of clarity [from my perspective]...
The code was shown to work on:
Check out RSelenium
First, install RSelenium and use the above linked vignette to get familiar with the basics
Then see this webinar on using RSelenium, which goes through some detailed scraping step-by-step and is quite easy to follow: http://johndharrison.blogspot.hk/2014/05/orange-county-r-users-group-oc-rug.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With