Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HTML table does not show on source file

I'm trying to scrape table data on a webpage using R (package rvest). To do that, the data needs to be in the html source file (that's where rvest looks for it apparently), but in this case it isn't.

However, data elements are shown in the Inspect panel's Elements view:

The table's elements are shown in the Elements view of the Inspect panel

Source file shows an empty table:

view-source file shows an empty table

Why is the data shown on inspect element but not on the source file? How can I acces the table data on html format? If I can't access through html how do I change my web scraping strategy?

*The web page is https://si3.bcentral.cl/siete/secure/cuadros/cuadro_dinamico.aspx?idMenu=IPC_VAR_MEN1_HIST&codCuadro=IPC_VAR_MEN1_HIST

Source file: view-source:https://si3.bcentral.cl/siete/secure/cuadros/cuadro_dinamico.aspx?idMenu=IPC_VAR_MEN1_HIST&codCuadro=IPC_VAR_MEN1_HIST


EDIT: a solution using R is appreciated

like image 808
David Jorquera Avatar asked Dec 08 '18 15:12

David Jorquera


People also ask

How do I find the HTML table of a website?

Open google chrome and visit that page, once there, right click the table and select "Inspect element", the devtools will open and then you'll see the table's code, right click it's opening tag and select "Copy as html". Show activity on this post.

How do you inspect a table element?

Let's select an element in the web table and find its XPath. For Chrome, right-click and inspect the given element to find its XPath. To find the XPath of a UI element in Firefox, right-click on the desired element, go to “Inspect Element” to open the inspector which will help identify its XPath.


2 Answers

I rly wish 'experts' would stop with the "you need Selenium/Headless Chrome" since it's almost never true and introduces a needless, heavyweight third-party dependency into data science workflows.

The site is an ASP.NET site so it makes heavy use of sessions and the programmers behind this particular one force that session to start at the home ("Hello, 2000 called and would like their session state preserving model back.")

Anyway, we need to start there and progress to your page. Here's what that looks like to your browser:

enter image description here

We can also see from 👆🏽 is that the site returns lovely JSON so we'll eventually grab that. Let's start modeling an R httr workflow like the above session:

library(xml2)
library(httr)
library(rvest)

Start at the, um, start!

httr::GET(
  url = "https://si3.bcentral.cl/Siete/secure/cuadros/home.aspx",
  httr::verbose()
) -> res

Now, we need to get the HTML from that page as there are a number of hidden values we need to supply to the POST that is made since that's part of how the brain-dead ASP.NET workflow works (again, follow the requests in the image above):

pg <- httr::content(res)

hinput <- html_nodes(pg, "input")
hinput <- as.list(setNames(html_attr(hinput, "value"), html_attr(hinput, "name")))
hinput$`header$txtBoxBuscador` <- ""
hinput$`__EVENTARGUMENT` <- ""
hinput$`__EVENTTARGET` <- "lnkBut01"

httr::POST(
  url = "https://si3.bcentral.cl/Siete/secure/cuadros/home.aspx",
  httr::add_headers(
    `Referer` = "https://si3.bcentral.cl/Siete/secure/cuadros/home.aspx"
  ),
  encode = "form",
  body = hinput
) -> res

Now we've done what we need to con the website into thinking we've got a proper session so let's make the request for the JSON content:

httr::GET(
  url = "https://si3.bcentral.cl/siete/secure/cuadros/actions.aspx",
  httr::add_headers(
    `X-Requested-With` = "XMLHttpRequest"
  ),
  query = list(
    Opcion = "1",
    idMenu = "IPC_VAR_MEN1_HIST",
    codCuadro = "IPC_VAR_MEN1_HIST",
    DrDwnAnioDesde = "",
    DrDwnAnioHasta = "",
    DrDwnAnioDiario = "",
    DropDownListFrequency = "",
    DrDwnCalculo = "NONE"
  )
) -> res

And, boom:

str(
  httr::content(res), 1
)

## List of 32
##  $ CodigoCuadro       : chr "IPC_VAR_MEN1_HIST"
##  $ Language           : chr "es-CL"
##  $ DescripcionCuadro  : chr "IPC, IPCX, IPCX1 e IPC SAE, variación mensual, información histórica"
##  $ AnioDesde          : int 1928
##  $ AnioHasta          : int 2018
##  $ FechaInicio        : chr "01-01-2010"
##  $ FechaFin           : chr "01-11-2018"
##  $ ListaFrecuencia    :List of 1
##  $ FrecuenciaDefecto  : NULL
##  $ DrDwnAnioDesde     :List of 3
##  $ DrDwnAnioHasta     :List of 3
##  $ DrDwnAnioDiario    :List of 3
##  $ hsDecimales        :List of 1
##  $ ListaCalculo       :List of 1
##  $ Metadatos          : chr " <img runat=\"server\" ID=\"imgButMetaDatos\" alt=\"Ver metadatos\" src=\"../../Images/lens.gif\" OnClick=\"jav"| __truncated__
##  $ NotasPrincipales   : chr ""
##  $ StatusTextBox      : chr ""
##  $ Grid               :List of 4
##  $ GridColumnNames    :List of 113
##  $ Paginador          : int 15
##  $ allowEmptyColumns  : logi FALSE
##  $ FechaInicioSelected: chr "2010"
##  $ FechaFinSelected   : chr "2018"
##  $ FrecuenciaSelected : chr "MONTHLY"
##  $ CalculoSelected    : chr "NONE"
##  $ AnioDiarioSelected : chr "2010"
##  $ UrlFechaBase       : chr "Indizar_fechaBase.aspx?codCuadro=IPC_VAR_MEN1_HIST"
##  $ FechaBaseCuadro    : chr "Ene 2010"
##  $ IsBoletin          : logi FALSE
##  $ CheckSelected      :List of 4
##  $ lnkButFechaBase    : logi FALSE
##  $ ShowFechaBase      : logi FALSE

Dig around in the JSON for the data you need. I think it's in the Grid… elements.

like image 36
hrbrmstr Avatar answered Oct 08 '22 18:10

hrbrmstr


The data is more than likely loaded dynamically from a data source or API. You can scrape the filled table by sending a GET request to the web page and scraping the page after the data has been loaded!

like image 167
Brady Ward Avatar answered Oct 08 '22 18:10

Brady Ward