I often use data from EUROSTAT and found it quite annoying that the data could not be load straight forwardly into R. I have written this snippet to get any dataset provided by the bulk download facility from EUROSTAT http://epp.eurostat.ec.europa.eu/NavTree_prod/everybody/BulkDownloadListing?sort=1&dir=dic%2Fen
Is there a better way? ..this one worked for me:
#this library is used to download data from eurostat and to find datasets
#later extend to extend to find datasets with certain dimensions
#download data from eurostat
#unpack and convert to dataframe
#load label descriptions
#load factors
#save as r data object
datasetname="ebd_all"
LANGUAGE="en"
install.packages("RCurl")
library(RCurl)
library(data.table)
library(reshape)
library(stringr)
baseurl="http://epp.eurostat.ec.europa.eu/NavTree_prod/everybody/BulkDownloadListing?sort=1&file=data%2F"
fullfilename=paste(datasetname,".tsv.gz",sep="")
temp <- paste(tempfile(),".gz",sep="")
download.file(paste(baseurl,fullfilename,sep=""),temp)
dataconnection <- gzfile(temp)
d=read.delim(dataconnection)
longdata=melt(d,id=colnames(d)[1])
firstname=colnames(d)[1] # remove .time and count how many headings are there
firstname=substr(firstname,1,nchar(firstname)-nchar(".time"))
headings=toupper(strsplit(firstname,".",fixed=TRUE)[[1]])
headingcount=length(headings)
colnames(longdata)=c("dimensions","time","value")
#get the data on the dimension tables
df=data.frame(dimensions=as.character(longdata[,"dimensions"]))
df = transform(df, dimensions= colsplit(dimensions, split = "\\,",names=headings))
dimensions=data.table(df$dimensions)
#download the dimension labels - save headings as better variable
dimfile=paste("http://epp.eurostat.ec.europa.eu/NavTree_prod/everybody/BulkDownloadListing?sort=1&file=dic%2F",LANGUAGE,"%2Fdimlst.dic",sep="")
temp <- paste(tempfile(),".gz",sep="")
download.file(dimfile,temp)
dataconnection <- gzfile(temp)
dimdata=read.delim(dataconnection,header=FALSE)
colnames(dimdata)=c("colname","desc")
lab=dimdata$desc
names(lab)=dimdata$colname
#create headings that speak for themselves for columns
speakingheadings=as.character(lab[headings])
#download factors for each heading and add
for(heading in headings){
factorfile=paste("http://epp.eurostat.ec.europa.eu/NavTree_prod/everybody/BulkDownloadListing?sort=1&file=dic%2F",LANGUAGE,"%2F",tolower(heading),".dic",sep="")
temp <- paste(tempfile(),".gz",sep="")
download.file(factorfile,temp)
dataconnection <- gzfile(temp)
factordata=read.delim(dataconnection,header=FALSE)
colnames(factordata)=c(heading,paste(heading,"_desc",sep=""))
#join the heading to the heading dataset
dimensions=merge(dimensions,factordata,by=heading,all.x=TRUE)
}
#at the end at speaking headings
setnames(dimensions,colnames(dimensions)[1:length(speakingheadings)],speakingheadings)
#add data columns by writing and reading again---FASTER ;-)
temp=tempfile()
values=data.frame(value=as.character(longdata$value))
values = transform(values, value= colsplit(value, split = "\\ ",names=c("value","flag")))
values=values$value
values=data.table(values)
values$value=as.character(values$value)
values$flag=as.character(values$flag)
values[value==flag,flag:=NA]
values$value=as.double(values$value)
eurostatdata=cbind(dimensions,time=longdata$time,values)
save(eurostatdata,file=paste(datasetname,".RData"))
Check out SmarterPoland package, there are functions to download (and get into R) data directly from EUROSTAT.
here is example:
library(SmarterPoland)
# info about passagers
grepEurostatTOC("split of passenger transport")
## get table
tmp <- getEurostatRCV("tsdtr210")
summary(tmp)
## vehicle geo time value
## BUS_TOT:756 AT : 63 1990 : 108 Min. : 0.0
## CAR :756 BE : 63 1991 : 108 1st Qu.: 6.9
## TRN :756 BG : 63 1992 : 108 Median :12.9
## CH : 63 1993 : 108 Mean :33.6
## CY : 63 1994 : 108 3rd Qu.:77.4
## CZ : 63 1995 : 108 Max. :93.4
## (Other):1890 (Other):1620 NA's :397
source: www.smarterpoland.pl
Eurostat disseminates their statistical data in SDMX format. In R, you can use the rsdmx package to read their data. See below example:
#in case you want to install rsdmx from Github
#(otherwise you can install it from CRAN)
require(devtools)
install_github("rsdmx", "opensdmx")
require(rsdmx)
#read EUROSTAT dataset
dataURL <- "http://ec.europa.eu/eurostat/SDMX/diss-web/rest/data/cdh_e_fos/..PC.FOS1.BE/?startperiod=2005&endPeriod=2011 "
sdmx <- readSDMX(dataURL)
stats <- as.data.frame(sdmx)
head(stats)
Note: You can find rsdmx either from CRAN or install it directly from GitHub repository. https://github.com/opensdmx/rsdmx
rsdmx is not limited to reading SDMX datasets
but all SDMX documents, data or metadata, including data structures
and codelists
. I invite you to check the rsdmx wiki if you want more examples.
Revised Answer
The RJSDMX package can also be used to retrieve data from Eurostat into R. An example is shown below.
library(RJSDMX)
data <- getTimeSeries("EUROSTAT","nama_gdp_c/.EUR_HAB.B1GM.DE")
Initial Answer (left here for reference)
A better way might be to use the Eurostat Web Service facility which offers programmatic access to Eurostat data. The Web Service facility is another service - in addition to the Bulk Download facility - provided by Eurostat for extracting data from their database. To use the service you'll need to send a REST or SOAP request. The data retrieved from the server is a hierarchical data structure - a type of XML document, which can be parsed using the XML package.
Before checking out the quick example below, I recommended reading some of the information about the service provided by Eurotat, which can be found here: http://epp.eurostat.ec.europa.eu/portal/page/portal/sdmx_web_services/getting_started/a_few_useful_points
# Step 0: Load the XML package.
# This is used later on to parse the XML retrieved from Eurostat.
# For a tutorial on XML and parsing XML documents, read this: http://www.w3schools.com/xpath/default.asp
library(XML)
# Step 1: Construct the appropriate REST query.
# First read this: http://epp.eurostat.ec.europa.eu/portal/page/portal/sdmx_web_services/getting_started/a_few_useful_points
# Specify the data to be retrieved.
resource <- "data"
dataflow <- "nama_gdp_c"
key <- ".EUR_HAB.B1GM.DE"
time_filter <- "?startPeriod=2010"
# Construct the query
partial_url <- paste(paste(resource, dataflow, key, sep="/"), time_filter, sep="")
base_url <- "http://ec.europa.eu/eurostat/SDMX/diss-web/rest/"
rest_query <- paste(base_url, partial_url, sep="")
# Step 2: Make the request using cURL (that is, retrieve the data)
# For information about cURL, read this: http://curl.haxx.se/
# For information about the curl command, check out the man pages: http://curl.haxx.se/docs/manpage.html
command <- paste("curl", rest_query)
raw_data <- system(command, intern=TRUE)
# Note: at this stage, the data is a character object.
class(raw_data)
# View the data, which can be found commented out at the bottom of this script. Note that it is a hierarchical data structure.
# Step 3: Parse the data
# Here we use functions from the XML package - one could, of course, use base package functions, but why?
data <- xmlParse(raw_data)
# Parsing the data returns an object of class: "XMLInternalDocument" and "XMLAbstractDocument"
class(data)
# Step 4: Extract the numerical data
# Data can be found using getNodeSet(), but the data remains stuck between "tags" - we just want the numbers.
getNodeSet(data,"//generic:ObsValue")
# The numbers we want to extract are, in this case, value "attributes". We can target these values as follows.
xpathApply(data, "//generic:ObsValue", xmlGetAttr, name="value")
# That does the job, but really we'd like those numbers in a vector rather than a list object.
numbers <- as.numeric(xpathApply(data, "//generic:ObsValue", xmlGetAttr, name="value"))
# Step 5: Extract the dates (years) - i.e. get some metadata.
# This is similar to above.
years <- as.numeric(xpathApply(data, "//generic:ObsDimension", xmlGetAttr, name="value"))
# Step 6 and on-wards:
#
# Enter your code here... =)
#
#
#
For reference, the raw data looks like this:
<?xml version="1.0" encoding="utf-8"?>
<message:GenericData xmlns:footer="http://www.sdmx.org/resources/sdmxml/schemas/v2_1/message/footer" xmlns:generic="http://www.sdmx.org/resources/sdmxml/schemas/v2_1/data/generic" xmlns:common="http://www.sdmx.org/resources/sdmxml/schemas/v2_1/common" xmlns:message="http://www.sdmx.org/resources/sdmxml/schemas/v2_1/message" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<message:Header>
<message:ID>13e08b8d24936d75b3a6fa1b9c632e22</message:ID>
<message:Test>false</message:Test>
<message:Prepared>2014-10-20T21:10:50</message:Prepared>
<message:Sender id="ESTAT">
<common:Name xml:lang="en">Eurostat</common:Name>
<message:Timezone>+01:00</message:Timezone>
</message:Sender>
<message:Receiver id="RECEIVER"/>
<message:Structure structureID="ESTAT_DSD_nama_gdp_c_1_0" dimensionAtObservation="TIME_PERIOD">
<common:Structure>
<Ref agencyID="ESTAT" id="DSD_nama_gdp_c" version="1.0"/>
</common:Structure>
</message:Structure>
<message:DataSetAction>Append</message:DataSetAction>
<message:DataSetID>nama_gdp_c</message:DataSetID>
</message:Header>
<message:DataSet structureRef="ESTAT_DSD_nama_gdp_c_1_0">
<generic:Series>
<generic:SeriesKey>
<generic:Value id="UNIT" value="EUR_HAB"/>
<generic:Value id="INDIC_NA" value="B1GM"/>
<generic:Value id="GEO" value="DE"/>
<generic:Value id="FREQ" value="A"/>
</generic:SeriesKey>
<generic:Obs>
<generic:ObsDimension value="2013"/>
<generic:ObsValue value="33300.0"/>
</generic:Obs>
<generic:Obs>
<generic:ObsDimension value="2012"/>
<generic:ObsValue value="32600.0"/>
</generic:Obs>
<generic:Obs>
<generic:ObsDimension value="2011"/>
<generic:ObsValue value="31900.0"/>
</generic:Obs>
<generic:Obs>
<generic:ObsDimension value="2010"/>
<generic:ObsValue value="30500.0"/>
</generic:Obs>
</generic:Series>
</message:DataSet>
</message:GenericData>
Points to note: I've used cURL to send the request, but this could have been done in many other ways, e.g. using Wget, Perl, PHP, etc. As long as you're willing (and able) to use the system() command, getting data into R from Eurostat in programmatic fashion should be straightforward enough (see edit below). Munging the data into, say, a ts object (or mts object depending on the query you send) should be straightforward enough, too. Lastly, I'm using a Linux operating system (Ubuntu distribution), so if you're on Windows, the example above may not work for you.
I hope this helps!
Edit: I just noticed you loaded the RCurl package, so if you want you could replace whatever facility that package offers with the system command that I used.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With