I'm trying to scrape table data on a webpage using R (package rvest
). To do that, the data needs to be in the html source file (that's where rvest
looks for it apparently), but in this case it isn't.
However, data elements are shown in the Inspect panel's Elements view:
Source file shows an empty table:
Why is the data shown on inspect element but not on the source file? How can I acces the table data on html format? If I can't access through html how do I change my web scraping strategy?
*The web page is https://si3.bcentral.cl/siete/secure/cuadros/cuadro_dinamico.aspx?idMenu=IPC_VAR_MEN1_HIST&codCuadro=IPC_VAR_MEN1_HIST
Source file: view-source:https://si3.bcentral.cl/siete/secure/cuadros/cuadro_dinamico.aspx?idMenu=IPC_VAR_MEN1_HIST&codCuadro=IPC_VAR_MEN1_HIST
EDIT: a solution using R is appreciated
Open google chrome and visit that page, once there, right click the table and select "Inspect element", the devtools will open and then you'll see the table's code, right click it's opening tag and select "Copy as html". Show activity on this post.
Let's select an element in the web table and find its XPath. For Chrome, right-click and inspect the given element to find its XPath. To find the XPath of a UI element in Firefox, right-click on the desired element, go to “Inspect Element” to open the inspector which will help identify its XPath.
I rly wish 'experts' would stop with the "you need Selenium/Headless Chrome" since it's almost never true and introduces a needless, heavyweight third-party dependency into data science workflows.
The site is an ASP.NET site so it makes heavy use of sessions and the programmers behind this particular one force that session to start at the home ("Hello, 2000 called and would like their session state preserving model back.")
Anyway, we need to start there and progress to your page. Here's what that looks like to your browser:
We can also see from 👆🏽 is that the site returns lovely JSON so we'll eventually grab that. Let's start modeling an R httr
workflow like the above session:
library(xml2)
library(httr)
library(rvest)
Start at the, um, start!
httr::GET(
url = "https://si3.bcentral.cl/Siete/secure/cuadros/home.aspx",
httr::verbose()
) -> res
Now, we need to get the HTML from that page as there are a number of hidden values we need to supply to the POST
that is made since that's part of how the brain-dead ASP.NET workflow works (again, follow the requests in the image above):
pg <- httr::content(res)
hinput <- html_nodes(pg, "input")
hinput <- as.list(setNames(html_attr(hinput, "value"), html_attr(hinput, "name")))
hinput$`header$txtBoxBuscador` <- ""
hinput$`__EVENTARGUMENT` <- ""
hinput$`__EVENTTARGET` <- "lnkBut01"
httr::POST(
url = "https://si3.bcentral.cl/Siete/secure/cuadros/home.aspx",
httr::add_headers(
`Referer` = "https://si3.bcentral.cl/Siete/secure/cuadros/home.aspx"
),
encode = "form",
body = hinput
) -> res
Now we've done what we need to con the website into thinking we've got a proper session so let's make the request for the JSON content:
httr::GET(
url = "https://si3.bcentral.cl/siete/secure/cuadros/actions.aspx",
httr::add_headers(
`X-Requested-With` = "XMLHttpRequest"
),
query = list(
Opcion = "1",
idMenu = "IPC_VAR_MEN1_HIST",
codCuadro = "IPC_VAR_MEN1_HIST",
DrDwnAnioDesde = "",
DrDwnAnioHasta = "",
DrDwnAnioDiario = "",
DropDownListFrequency = "",
DrDwnCalculo = "NONE"
)
) -> res
And, boom:
str(
httr::content(res), 1
)
## List of 32
## $ CodigoCuadro : chr "IPC_VAR_MEN1_HIST"
## $ Language : chr "es-CL"
## $ DescripcionCuadro : chr "IPC, IPCX, IPCX1 e IPC SAE, variación mensual, información histórica"
## $ AnioDesde : int 1928
## $ AnioHasta : int 2018
## $ FechaInicio : chr "01-01-2010"
## $ FechaFin : chr "01-11-2018"
## $ ListaFrecuencia :List of 1
## $ FrecuenciaDefecto : NULL
## $ DrDwnAnioDesde :List of 3
## $ DrDwnAnioHasta :List of 3
## $ DrDwnAnioDiario :List of 3
## $ hsDecimales :List of 1
## $ ListaCalculo :List of 1
## $ Metadatos : chr " <img runat=\"server\" ID=\"imgButMetaDatos\" alt=\"Ver metadatos\" src=\"../../Images/lens.gif\" OnClick=\"jav"| __truncated__
## $ NotasPrincipales : chr ""
## $ StatusTextBox : chr ""
## $ Grid :List of 4
## $ GridColumnNames :List of 113
## $ Paginador : int 15
## $ allowEmptyColumns : logi FALSE
## $ FechaInicioSelected: chr "2010"
## $ FechaFinSelected : chr "2018"
## $ FrecuenciaSelected : chr "MONTHLY"
## $ CalculoSelected : chr "NONE"
## $ AnioDiarioSelected : chr "2010"
## $ UrlFechaBase : chr "Indizar_fechaBase.aspx?codCuadro=IPC_VAR_MEN1_HIST"
## $ FechaBaseCuadro : chr "Ene 2010"
## $ IsBoletin : logi FALSE
## $ CheckSelected :List of 4
## $ lnkButFechaBase : logi FALSE
## $ ShowFechaBase : logi FALSE
Dig around in the JSON for the data you need. I think it's in the Grid…
elements.
The data is more than likely loaded dynamically from a data source or API. You can scrape the filled table by sending a GET request to the web page and scraping the page after the data has been loaded!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With