Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Download economic official data from a Central Bank web page

Tags:

dataframe

r

I've been searching for a while any answer to my question. I read this, this and this and some others related but I still get not answer.

My problem is quite simple (I hope it is) but the answer is not (at least for myself), I want to import some economic data from this web which is an indicator for Nicaraguan economic activity measured each month, so far I've tried this:

library(XML)
u <- "http://www.bcn.gob.ni/estadisticas/trimestrales_y_mensuales/siec/datos/4.IMAE.htm"
u <- htmlParse(u,encoding="UTF-8")
imae <- readHTMLTable(doc=u, header=T)
imae

library(httr)
u2 <- "http://www.bcn.gob.ni/estadisticas/trimestrales_y_mensuales/siec/datos/4.IMAE.htm"
page <- GET(u2, user_agent("httr"))
x <- readHTMLTable(text_content(page), as.data.frame=TRUE)

with no success as you can imagine. The first chunk of code gave me this output

   $`NULL`
                                  BANCO CENTRAL DE NICARAGUA    NA    NA    NA   NA   NA   NA   NA   NA   NA   NA    NA    NA       NA
1                                                             <NA>  <NA>  <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>  <NA>  <NA>     <NA>
2 <U+633C><U+3E64>ndice Mensual de Actividad Económica(IMAE)  <NA>  <NA>  <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>  <NA>  <NA>     <NA>
3                                           (Base: 1994=100)  <NA>  <NA>  <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>  <NA>  <NA>     <NA>
4                                                             <NA>  <NA>  <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>  <NA>  <NA>     <NA>
5                                                        Año   Ene   Feb   Mar  Abr  May  Jun  Jul  Ago  Sep  Oct   Nov   Dic Promedio
6                                                             <NA>  <NA>  <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>  <NA>  <NA>     <NA>
7                                                       1994 101.6 107.6 100.1 95.7 94.7 92.8 92.1 96.8 98.5 97.4 101.7 121.1    100.0
8                                               Fuente: BCN.  <NA>  <NA>  <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>  <NA>  <NA>     <NA>

I tried using skip.rows=1:5 but it doesn't really change the main result which is too much NA. Is there anybody who can shed some light on this question?

The expected result is a data.frame with the information shown in this web

like image 820
Jilber Urbina Avatar asked Oct 05 '12 19:10

Jilber Urbina


People also ask

How can I download data from IMF?

To download data: Either go to the "Download IMF Data" tab, or, for some datasets, click on the top-right hand corner of a dataset portal. Click on the Bulk Download button located next to the dataset or the Bulk Download icon in some data portals.

How can I download data from DataBank?

Once you selected countries, indicators and years you can download the data to your computer by clicking on the Download button at the top of the screen. Select the format for your report (Excel, CSV, TXT, or SDMX) and click Download.

How do I get data from WDI?

To access a specific WDI online table directly, use the URL http://wdi.worldbank.org/table/ and the table number (for example, http://wdi.worldbank.org/table/1.1 to view the first table in the Poverty and Shared Prosperity section). Each section of the book references the indicators included by table and by code.


2 Answers

As I mentioned in my comment, the problem is most likely arising because of the poorly coded table.

You can try an approach something like the following (tested on Ubuntu using RStudio). It requires that you have wget and html tidy installed. If you don't want to install these useful programs, jump to the updated part of this answer.

  1. Download the page and "tidy" it up.

    system("wget http://www.bcn.gob.ni/estadisticas/trimestrales_y_mensuales/siec/datos/4.IMAE.htm")
    system("tidy 4.IMAE.htm > new.html")
    
  2. Proceed with R as you normally would

    library(XML)
    u <- htmlParse("new.html")
    imae <- readHTMLTable(u)
    
  3. If we view the output of the above readHTMLTable, we would see that we need to skip a few rows. Let's run it again:

    imae <- readHTMLTable(u, skip.rows=c(1:5, 7, 27, 28), header=TRUE)
    imae
    # $`NULL`
    #     Año   Ene   Feb   Mar   Abr   May   Jun   Jul   Ago   Sep   Oct   Nov   Dic Promedio
    # 1  1994 101.6 107.6 100.1  95.7  94.7  92.8  92.1  96.8  98.5  97.4 101.7 121.1    100.0
    # 2  1995 113.2 105.0 113.6  98.0 100.9  95.4  99.8 101.5 108.3 107.1 107.6 133.2    107.0
    # 3  1996 123.6 116.0 109.1 107.3  94.8 101.2 100.7 115.3 110.6 112.7 117.5 137.7    112.2
    # 4  1997 133.4 115.9 117.4 118.8 120.4 108.2 107.4 111.1 120.3 117.7 119.5 142.3    119.4
    # 5  1998 131.4 120.4 127.9 118.4 130.2 116.5 122.1 129.7 127.3 127.5 112.7 156.6    126.7
    # 6  1999 146.0 139.6 146.9 134.8 140.6 131.8 130.6 128.3 128.9 131.8 142.7 172.6    139.5
    # 7  2000 157.8 142.1 147.3 138.5 137.7 135.7 128.9 131.2 141.7 143.0 156.6 191.2    146.0
    # 8  2001 163.3 143.8 154.8 141.5 147.6 134.0 135.7 143.3 138.2 138.8 145.3 187.3    147.8
    # 9  2002 152.1 144.7 143.3 142.1 143.1 131.9 136.1 145.7 146.4 147.8 157.5 185.0    148.0
    # 10 2003 159.3 151.4 149.1 142.7 139.7 139.1 145.6 147.8 154.9 158.4 157.8 195.7    153.5
    # 11 2004 172.8 157.1 166.9 153.6 161.2 150.5 155.3 153.3 156.6 155.6 167.7 213.0    163.6
    # 12 2005 183.1 170.6 173.6 158.7 160.8 158.5 158.8 168.7 165.8 165.4 178.4 218.8    171.8
    # 13 2006 187.7 177.8 185.6 161.8 166.4 163.2 164.7 175.1 175.1 185.3 189.6 231.2    180.3
    # 14 2007 200.1 184.1 196.5 180.1 169.7 171.4 181.6 180.9 173.0 182.8 202.0 236.7    188.2
    # 15 2008 205.4 194.4 193.1 205.9 171.0 174.8 181.3 190.7 183.1 182.7 182.5 244.7    192.5
    # 16 2009 195.7 191.0 190.8 177.0 168.1 172.6 179.2 185.6 178.9 181.4 191.3 241.4    187.7
    # 17 2010 195.2 193.7 205.1 185.2 179.3 190.1 191.6 190.0 193.5 197.6 210.9 266.0    199.8
    # 18 2011 213.9 207.4 217.3 198.7 196.1 198.8 191.9 210.0 203.7 207.9 217.3 274.5    211.5
    # 19 2012 233.6                                                                      233.6
    

Update: A little function to help out

If you can live with having to do some text cleanup for the accented characters, the W3C offers an online implementation of html tidy. This allows you to write a basic function like the following:

tidyHTML <- function(URL) {
  require(XML)
  URL = gsub("/", "%2F", URL)
  URL <- gsub(":", "%3A", URL)
  URL <- paste("http://services.w3.org/tidy/tidy?docAddr=", URL, sep = "")
  htmlParse(URL)
}

Usage is simple:

u <- tidyHTML("http://www.bcn.gob.ni/estadisticas/trimestrales_y_mensuales/siec/datos/4.IMAE.htm")
readHTMLTable(u)
like image 132
A5C1D2H2I1M1N2O1R2T1 Avatar answered Oct 19 '22 17:10

A5C1D2H2I1M1N2O1R2T1


This is a bit of a hack job which sort of works if the table is not well enough structured as in the other responses you linked to. But it is really more of a one-off that works if the format doesn't change, but beware--can be risky. There are likely more general solutions that folks can add.

require(RCurl)
require(XML)
u <- "http://www.bcn.gob.ni/estadisticas/trimestrales_y_mensuales/siec/datos/4.IMAE.htm"

webpage <- getURL(u)
lines <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(lines, error=function(...){}, useInternalNodes = TRUE)

# parse tree by any tables
x <- xpathSApply(pagetree, "//*/table", xmlValue)  

# remove white space and such w/ regexes
unlisted <- unlist(strsplit(x, "\n"))
notabs <- gsub("\t","",unlisted)
nowhitespace <- sub("^[[:space:]]*(.*?)[[:space:]]*$", "\\1", notabs, perl=TRUE)
data <- nowhitespace[!(nowhitespace %in% c("", "|"))]

here comes the dodgy part:

months<-data[5:16]
data_out<-data[18:(length(data)-4)] #omits 2012 data to easily fit structure argument

finalhack<-data.frame(t(structure(
data_out,dim = c(14,18),.Dimnames =     
list(c('year',months,'index'),seq(1994,2011)))))
like image 35
ako Avatar answered Oct 19 '22 17:10

ako