Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is this the solution to get data from EUROSTAT into R?

Tags:

r

I often use data from EUROSTAT and found it quite annoying that the data could not be load straight forwardly into R. I have written this snippet to get any dataset provided by the bulk download facility from EUROSTAT http://epp.eurostat.ec.europa.eu/NavTree_prod/everybody/BulkDownloadListing?sort=1&dir=dic%2Fen

Is there a better way? ..this one worked for me:

    #this library is used to download data from eurostat and to find datasets
#later extend to extend to find datasets with certain dimensions

#download data from eurostat
#unpack and convert to dataframe
#load label descriptions
#load factors
#save as r data object

datasetname="ebd_all"

LANGUAGE="en"

install.packages("RCurl")
library(RCurl)
library(data.table)
library(reshape)
library(stringr)

baseurl="http://epp.eurostat.ec.europa.eu/NavTree_prod/everybody/BulkDownloadListing?sort=1&file=data%2F"

fullfilename=paste(datasetname,".tsv.gz",sep="")
temp <- paste(tempfile(),".gz",sep="")
download.file(paste(baseurl,fullfilename,sep=""),temp)
dataconnection <- gzfile(temp)
d=read.delim(dataconnection)
longdata=melt(d,id=colnames(d)[1])

firstname=colnames(d)[1] # remove .time and count how many headings are there 
firstname=substr(firstname,1,nchar(firstname)-nchar(".time"))
headings=toupper(strsplit(firstname,".",fixed=TRUE)[[1]])
headingcount=length(headings)
colnames(longdata)=c("dimensions","time","value")


#get the data on the dimension tables
df=data.frame(dimensions=as.character(longdata[,"dimensions"]))
df = transform(df, dimensions= colsplit(dimensions, split = "\\,",names=headings))
dimensions=data.table(df$dimensions)

#download the dimension labels - save headings as better variable
dimfile=paste("http://epp.eurostat.ec.europa.eu/NavTree_prod/everybody/BulkDownloadListing?sort=1&file=dic%2F",LANGUAGE,"%2Fdimlst.dic",sep="")

temp <- paste(tempfile(),".gz",sep="")
download.file(dimfile,temp)
dataconnection <- gzfile(temp)
dimdata=read.delim(dataconnection,header=FALSE)
colnames(dimdata)=c("colname","desc")
lab=dimdata$desc
names(lab)=dimdata$colname

#create  headings that speak for themselves for columns
speakingheadings=as.character(lab[headings])

#download factors for each heading and add
for(heading in headings){
  factorfile=paste("http://epp.eurostat.ec.europa.eu/NavTree_prod/everybody/BulkDownloadListing?sort=1&file=dic%2F",LANGUAGE,"%2F",tolower(heading),".dic",sep="")
  temp <- paste(tempfile(),".gz",sep="")
  download.file(factorfile,temp)
  dataconnection <- gzfile(temp)
  factordata=read.delim(dataconnection,header=FALSE)
  colnames(factordata)=c(heading,paste(heading,"_desc",sep=""))
  #join the heading to the heading dataset
  dimensions=merge(dimensions,factordata,by=heading,all.x=TRUE)
}


  #at the end at speaking headings
setnames(dimensions,colnames(dimensions)[1:length(speakingheadings)],speakingheadings)

  #add data columns by writing and reading again---FASTER ;-)
temp=tempfile()
values=data.frame(value=as.character(longdata$value))
values = transform(values, value= colsplit(value, split = "\\ ",names=c("value","flag")))
values=values$value
values=data.table(values)

values$value=as.character(values$value)
values$flag=as.character(values$flag)
values[value==flag,flag:=NA]
values$value=as.double(values$value)

eurostatdata=cbind(dimensions,time=longdata$time,values) 
save(eurostatdata,file=paste(datasetname,".RData"))
like image 981
Johannes Kutsam Avatar asked Oct 06 '12 18:10

Johannes Kutsam


3 Answers

Check out SmarterPoland package, there are functions to download (and get into R) data directly from EUROSTAT.

here is example:

library(SmarterPoland)
# info about passagers
grepEurostatTOC("split of passenger transport")
## get table
tmp <- getEurostatRCV("tsdtr210")
summary(tmp)

##     vehicle         geo            time          value     
##  BUS_TOT:756   AT     :  63   1990   : 108   Min.   : 0.0  
##  CAR    :756   BE     :  63   1991   : 108   1st Qu.: 6.9  
##  TRN    :756   BG     :  63   1992   : 108   Median :12.9  
##                CH     :  63   1993   : 108   Mean   :33.6  
##                CY     :  63   1994   : 108   3rd Qu.:77.4  
##                CZ     :  63   1995   : 108   Max.   :93.4  
##                (Other):1890   (Other):1620   NA's   :397

source: www.smarterpoland.pl

like image 67
Maciej Avatar answered Nov 04 '22 13:11

Maciej


Eurostat disseminates their statistical data in SDMX format. In R, you can use the rsdmx package to read their data. See below example:

#in case you want to install rsdmx from Github
#(otherwise you can install it from CRAN)
require(devtools)
install_github("rsdmx", "opensdmx")
require(rsdmx)

#read EUROSTAT dataset
dataURL <- "http://ec.europa.eu/eurostat/SDMX/diss-web/rest/data/cdh_e_fos/..PC.FOS1.BE/?startperiod=2005&endPeriod=2011 "
sdmx <- readSDMX(dataURL)
stats <- as.data.frame(sdmx)
head(stats)

Note: You can find rsdmx either from CRAN or install it directly from GitHub repository. https://github.com/opensdmx/rsdmx

rsdmx is not limited to reading SDMX datasets but all SDMX documents, data or metadata, including data structures and codelists. I invite you to check the rsdmx wiki if you want more examples.

like image 38
eblondel Avatar answered Nov 04 '22 12:11

eblondel


Revised Answer

The RJSDMX package can also be used to retrieve data from Eurostat into R. An example is shown below.

library(RJSDMX)
data <- getTimeSeries("EUROSTAT","nama_gdp_c/.EUR_HAB.B1GM.DE")

Initial Answer (left here for reference)

A better way might be to use the Eurostat Web Service facility which offers programmatic access to Eurostat data. The Web Service facility is another service - in addition to the Bulk Download facility - provided by Eurostat for extracting data from their database. To use the service you'll need to send a REST or SOAP request. The data retrieved from the server is a hierarchical data structure - a type of XML document, which can be parsed using the XML package.

Before checking out the quick example below, I recommended reading some of the information about the service provided by Eurotat, which can be found here: http://epp.eurostat.ec.europa.eu/portal/page/portal/sdmx_web_services/getting_started/a_few_useful_points

# Step 0: Load the XML package.
# This is used later on to parse the XML retrieved from Eurostat.
# For a tutorial on XML and parsing XML documents, read this: http://www.w3schools.com/xpath/default.asp

library(XML)

# Step 1: Construct the appropriate REST query.
# First read this:     http://epp.eurostat.ec.europa.eu/portal/page/portal/sdmx_web_services/getting_started/a_few_useful_points

# Specify the data to be retrieved.

resource    <- "data"
dataflow    <- "nama_gdp_c"
key         <- ".EUR_HAB.B1GM.DE"
time_filter <- "?startPeriod=2010"

# Construct the query

partial_url <- paste(paste(resource, dataflow, key, sep="/"), time_filter, sep="")
base_url    <- "http://ec.europa.eu/eurostat/SDMX/diss-web/rest/"
rest_query  <- paste(base_url, partial_url, sep="")

# Step 2: Make the request using cURL (that is, retrieve the data)
# For information about cURL, read this: http://curl.haxx.se/
# For information about the curl command, check out the man pages: http://curl.haxx.se/docs/manpage.html

command   <- paste("curl", rest_query)
raw_data  <- system(command, intern=TRUE)

# Note: at this stage, the data is a character object.

class(raw_data)

# View the data, which can be found commented out at the bottom of this script. Note that it is a hierarchical data structure. 

# Step 3: Parse the data
# Here we use functions from the XML package - one could, of course, use base package functions, but why?

data <- xmlParse(raw_data)

# Parsing the data returns an object of class: "XMLInternalDocument" and "XMLAbstractDocument"

class(data)

# Step 4: Extract the numerical data 

# Data can be found using getNodeSet(), but the data remains stuck between "tags" - we just want the numbers.

getNodeSet(data,"//generic:ObsValue")

# The numbers we want to extract are, in this case, value "attributes". We can target these values as follows.

xpathApply(data, "//generic:ObsValue", xmlGetAttr, name="value")

# That does the job, but really we'd like those numbers in a vector rather than a list object.

numbers <- as.numeric(xpathApply(data, "//generic:ObsValue", xmlGetAttr, name="value"))

# Step 5: Extract the dates (years) - i.e. get some metadata.
# This is similar to above.

years <- as.numeric(xpathApply(data, "//generic:ObsDimension", xmlGetAttr, name="value"))

# Step 6 and on-wards:
#
# Enter your code here... =)
#
#
#

For reference, the raw data looks like this:

<?xml version="1.0" encoding="utf-8"?>
<message:GenericData xmlns:footer="http://www.sdmx.org/resources/sdmxml/schemas/v2_1/message/footer" xmlns:generic="http://www.sdmx.org/resources/sdmxml/schemas/v2_1/data/generic" xmlns:common="http://www.sdmx.org/resources/sdmxml/schemas/v2_1/common" xmlns:message="http://www.sdmx.org/resources/sdmxml/schemas/v2_1/message" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <message:Header>
    <message:ID>13e08b8d24936d75b3a6fa1b9c632e22</message:ID>
    <message:Test>false</message:Test>
    <message:Prepared>2014-10-20T21:10:50</message:Prepared>
    <message:Sender id="ESTAT">
      <common:Name xml:lang="en">Eurostat</common:Name>
      <message:Timezone>+01:00</message:Timezone>
    </message:Sender>
    <message:Receiver id="RECEIVER"/>
    <message:Structure structureID="ESTAT_DSD_nama_gdp_c_1_0" dimensionAtObservation="TIME_PERIOD">
      <common:Structure>
        <Ref agencyID="ESTAT" id="DSD_nama_gdp_c" version="1.0"/>
      </common:Structure>
    </message:Structure>
    <message:DataSetAction>Append</message:DataSetAction>
    <message:DataSetID>nama_gdp_c</message:DataSetID>
  </message:Header>
  <message:DataSet structureRef="ESTAT_DSD_nama_gdp_c_1_0">
    <generic:Series>
      <generic:SeriesKey>
        <generic:Value id="UNIT" value="EUR_HAB"/>
        <generic:Value id="INDIC_NA" value="B1GM"/>
        <generic:Value id="GEO" value="DE"/>
        <generic:Value id="FREQ" value="A"/>
      </generic:SeriesKey>
      <generic:Obs>
        <generic:ObsDimension value="2013"/>
        <generic:ObsValue value="33300.0"/>
      </generic:Obs>
      <generic:Obs>
        <generic:ObsDimension value="2012"/>
        <generic:ObsValue value="32600.0"/>
      </generic:Obs>
      <generic:Obs>
        <generic:ObsDimension value="2011"/>
        <generic:ObsValue value="31900.0"/>
      </generic:Obs>
      <generic:Obs>
        <generic:ObsDimension value="2010"/>
        <generic:ObsValue value="30500.0"/>
      </generic:Obs>
    </generic:Series>
  </message:DataSet>
</message:GenericData>

Points to note: I've used cURL to send the request, but this could have been done in many other ways, e.g. using Wget, Perl, PHP, etc. As long as you're willing (and able) to use the system() command, getting data into R from Eurostat in programmatic fashion should be straightforward enough (see edit below). Munging the data into, say, a ts object (or mts object depending on the query you send) should be straightforward enough, too. Lastly, I'm using a Linux operating system (Ubuntu distribution), so if you're on Windows, the example above may not work for you.

I hope this helps!

Edit: I just noticed you loaded the RCurl package, so if you want you could replace whatever facility that package offers with the system command that I used.

like image 2
Graeme Walsh Avatar answered Nov 04 '22 12:11

Graeme Walsh