I have some R scripts, where I have to load several dataframe in R as quickly as possible. This is quite important as reading the data is the slowest part of the procedure. E.g.: plotting from different dataframes. I get the data in sav (SPSS) format, but I could transform it to any format as suggested. Merging the dataframes is not an option unfortunately. What could be the fastest way to load the data? I was thinking of the following: <ul> <li>transform from sav to binary R object (Rdata) in the first time, and later always load this, as it seems a lot quicker than <code>read.spss</code>.</li> <li>transform from sav to csv files and reading data from those with given parameters discussed in this topic,</li> <li>or is it worth setting up a MySQL backend on localhost and load data from that? Could it be faster? If so, can I also save any custom <code>attr</code> values of the variables (e.g. variable.labels from Spss imported files)? Or this should be done in a separate table?</li> </ul> Any other thoughts are welcome. Thank you for every suggestion in advance! <hr> I made a little experiment below based on the answers you have given, and also added (24/01/2011) a quite "hackish" but really speedy solution loading only a few variables/columns from a special binary file. The latter seems to be the fastest method I can imagine now, that is why I made up (05/03/2011: ver. 0.3) a small package named saves to deal with this feature. The package is under "heavy" development, any recommendation is welcome! I will soon post a vignette with accurate benchmark results with the help of microbenchmark package.

Thank you all for the tips and answers, I did some summary and experiment based on that. See a little test with a public database (ESS 2008 in Hungary) below. The database have 1508 cases and 508 variables, so it could be a mid-sized data. That might be a good example to do the test on (for me), but of course special needs would require an experiment with adequate data. Reading the data from SPSS sav file without any modification: <pre class="prettyprint"><code>> system.time(data <- read.spss('ESS_HUN_4.sav')) user system elapsed 2.214 0.030 2.376 </code></pre> Loading with a converted binary object: <pre class="prettyprint"><code>> save('data',file='ESS_HUN_4.Rdata') > system.time(data.Rdata <- load('ESS_HUN_4.Rdata')) user system elapsed 0.28 0.00 0.28 </code></pre> Trying with csv: <pre class="prettyprint"><code>> write.table(data, file="ESS_HUN_4.csv") > system.time(data.csv <- read.csv('ESS_HUN_4.csv')) user system elapsed 1.730 0.010 1.824 </code></pre> Trying with "fine-tuned" csv loading: <pre class="prettyprint"><code>> system.time(data.csv <- read.table('ESS_HUN_4.csv', comment.char="", stringsAsFactors=FALSE, sep=",")) user system elapsed 1.296 0.014 1.362 </code></pre> Also with package sqldf, which seems to load csv files a lot faster: <pre class="prettyprint"><code>> library(sqldf) > f <- file("ESS_HUN_4.csv") > system.time(bigdf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F, sep="\t"))) user system elapsed 0.939 0.106 1.071 </code></pre> And also loading the data from a MySQL database running on localhost: <pre class="prettyprint"><code>> library(RMySQL) > con <- dbConnect(MySQL(), user='root', dbname='test', host='localhost', password='') > dbWriteTable(con, "data", as.data.frame(data), overwrite = TRUE) > system.time(data <- dbReadTable(con, 'data')) user system elapsed 0.583 0.026 1.055 > query <-('SELECT * FROM data') > system.time(data.sql <- dbGetQuery(con, query)) user system elapsed 0.270 0.020 0.473 </code></pre> Here, I think we should add the two <code>system.time</code> reported, as connecting to the data also counts in our case. Please comment, if I misunderstood something. But let us see if querying only some variables, as eg. while plotting we do not need all the dataframe in most cases, and querying only two variables is enough to create a nice plot of them: <pre class="prettyprint"><code>> query <-('SELECT c1, c19 FROM data') > system.time(data.sql <- dbGetQuery(con, query)) user system elapsed 0.030 0.000 0.112 </code></pre> Which seems really great! Of course just after loading the table with <code>dbReadTable</code> Summary: nothing to beat reading the whole data from binary file, but reading only a few columns (or other filtered data) from the same database table might be also weighted in some special cases. Test environment: HP 6715b laptop (AMD X2 2Ghz, 4 Gb DDR2) with a low-end SSD. <hr> UPDATE (24/01/2011): I added a rather hackish, but quite "creative" way of loading only a few columns of a binary object - which looks a lot faster then any method examined above. Be aware: the code will look really bad, but still very effective :) First, I save all columns of a data.frame into different binary objects via the following loop: <pre class="prettyprint"><code>attach(data) for (i in 1:length(data)) { save(list=names(data)[i],file=paste('ESS_HUN_4-', names(data)[i], '.Rdata', sep='')) } detach(data) </code></pre> And then I load two columns of the data: <pre class="prettyprint"><code>> system.time(load('ESS_HUN_4-c19.Rdata')) + > system.time(load('ESS_HUN_4-c1.Rdata')) + > system.time(data.c1_c19 <- cbind(c1, c19)) user system elapsed 0.003 0.000 0.002 </code></pre> Which looks like a "superfast" method! :) Note: it was loaded 100 times faster than the fastest (loading the whole binary object) method above. I have made up a very tiny package (named: saves), look in github for more details if interested. <hr> UPDATE (06/03/2011): a new version of my little package (saves) was uploaded to CRAN, in which it is possible to save and load variables even faster - if only the user needs only a subset of the available variables in a data frame or list. See the vignette in the package sources for details or the one on my homepage, and let me introduce also a nice boxplot of some benchmark done: <img src="https://i.stack.imgur.com/7vbUW.png" alt="Comparison of different data frame/list loading mechanism by speed"> This boxplot shows the benefit of using saves package to load only a subset of variables against <code>load</code> and <code>read.table</code> or <code>read.csv</code> from base, <code>read.spss</code> from foreign or <code>sqldf</code> or <code>RMySQL</code> packages.

How to load data quickly into R?

Tags:

performance

r

benchmarking

load

I have some R scripts, where I have to load several dataframe in R as quickly as possible. This is quite important as reading the data is the slowest part of the procedure. E.g.: plotting from different dataframes. I get the data in sav (SPSS) format, but I could transform it to any format as suggested. Merging the dataframes is not an option unfortunately.

What could be the fastest way to load the data? I was thinking of the following:

transform from sav to binary R object (Rdata) in the first time, and later always load this, as it seems a lot quicker than read.spss.
transform from sav to csv files and reading data from those with given parameters discussed in this topic,
or is it worth setting up a MySQL backend on localhost and load data from that? Could it be faster? If so, can I also save any custom attr values of the variables (e.g. variable.labels from Spss imported files)? Or this should be done in a separate table?

Any other thoughts are welcome. Thank you for every suggestion in advance!

I made a little experiment below based on the answers you have given, and also added (24/01/2011) a quite "hackish" but really speedy solution loading only a few variables/columns from a special binary file. The latter seems to be the fastest method I can imagine now, that is why I made up (05/03/2011: ver. 0.3) a small package named saves to deal with this feature. The package is under "heavy" development, any recommendation is welcome!

I will soon post a vignette with accurate benchmark results with the help of microbenchmark package.

840

asked Jan 21 '11 08:01

daroczig

Video Answer

2 Answers

Thank you all for the tips and answers, I did some summary and experiment based on that.

See a little test with a public database (ESS 2008 in Hungary) below. The database have 1508 cases and 508 variables, so it could be a mid-sized data. That might be a good example to do the test on (for me), but of course special needs would require an experiment with adequate data.

Reading the data from SPSS sav file without any modification:

> system.time(data <- read.spss('ESS_HUN_4.sav'))    user  system elapsed    2.214   0.030   2.376

Loading with a converted binary object:

> save('data',file='ESS_HUN_4.Rdata') > system.time(data.Rdata <- load('ESS_HUN_4.Rdata'))    user  system elapsed     0.28    0.00    0.28

Trying with csv:

> write.table(data, file="ESS_HUN_4.csv") > system.time(data.csv <- read.csv('ESS_HUN_4.csv'))    user  system elapsed    1.730   0.010   1.824

Trying with "fine-tuned" csv loading:

> system.time(data.csv <- read.table('ESS_HUN_4.csv', comment.char="", stringsAsFactors=FALSE, sep=","))    user  system elapsed    1.296   0.014   1.362

Also with package sqldf, which seems to load csv files a lot faster:

> library(sqldf) > f <- file("ESS_HUN_4.csv") >  system.time(bigdf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F, sep="\t")))    user  system elapsed    0.939   0.106   1.071

And also loading the data from a MySQL database running on localhost:

> library(RMySQL)  > con <- dbConnect(MySQL(), user='root', dbname='test', host='localhost', password='') > dbWriteTable(con, "data", as.data.frame(data), overwrite = TRUE) > system.time(data <- dbReadTable(con, 'data'))    user  system elapsed    0.583   0.026   1.055  > query <-('SELECT * FROM data') > system.time(data.sql <- dbGetQuery(con, query))    user  system elapsed    0.270   0.020   0.473

Here, I think we should add the two system.time reported, as connecting to the data also counts in our case. Please comment, if I misunderstood something.

But let us see if querying only some variables, as eg. while plotting we do not need all the dataframe in most cases, and querying only two variables is enough to create a nice plot of them:

> query <-('SELECT c1, c19 FROM data') > system.time(data.sql <- dbGetQuery(con, query))    user  system elapsed    0.030   0.000   0.112

Which seems really great! Of course just after loading the table with dbReadTable

Summary: nothing to beat reading the whole data from binary file, but reading only a few columns (or other filtered data) from the same database table might be also weighted in some special cases.

Test environment: HP 6715b laptop (AMD X2 2Ghz, 4 Gb DDR2) with a low-end SSD.

UPDATE (24/01/2011): I added a rather hackish, but quite "creative" way of loading only a few columns of a binary object - which looks a lot faster then any method examined above.

Be aware: the code will look really bad, but still very effective :)

First, I save all columns of a data.frame into different binary objects via the following loop:

attach(data) for (i in 1:length(data)) {     save(list=names(data)[i],file=paste('ESS_HUN_4-', names(data)[i], '.Rdata', sep='')) } detach(data)

And then I load two columns of the data:

> system.time(load('ESS_HUN_4-c19.Rdata')) +  >     system.time(load('ESS_HUN_4-c1.Rdata')) +  >     system.time(data.c1_c19 <- cbind(c1, c19))     user  system elapsed      0.003   0.000   0.002

Which looks like a "superfast" method! :) Note: it was loaded 100 times faster than the fastest (loading the whole binary object) method above.

I have made up a very tiny package (named: saves), look in github for more details if interested.

UPDATE (06/03/2011): a new version of my little package (saves) was uploaded to CRAN, in which it is possible to save and load variables even faster - if only the user needs only a subset of the available variables in a data frame or list. See the vignette in the package sources for details or the one on my homepage, and let me introduce also a nice boxplot of some benchmark done:

Comparison of different data frame/list loading mechanism by speed

This boxplot shows the benefit of using saves package to load only a subset of variables against load and read.table or read.csv from base, read.spss from foreign or sqldf or RMySQL packages.

answered Oct 02 '22 13:10

daroczig

It depends on what you want to do and how you process the data further. In any case, loading from a binary R object is always going to be faster, provided you always need the same dataset. The limiting speed here is the speed of your harddrive, not R. The binary form is the internal representation of the dataframe in the workspace, so there is no transformation needed anymore.

Any kind of text file is a different story, as you include invariably an overhead : each time you read in the text file, the data has to be transformed to the binary R object. I'd forget about them. They are only useful for porting datasets from one application to another.

Setting up a MySQL backend is very useful if you need different parts of the data, or different subsets in different combinations. Especially when working with huge datasets, the fact that you don't have to load in the whole dataset before you can start selecting the rows/columns, can gain you quite some time. But this only works with huge datasets, as reading a binary file is quite a bit faster than searching a database.

If the data is not too big, you can save different dataframes in one RData file, giving you the opportunity to streamline things a bit more. I often have a set of dataframes in a list or in a seperate environment (see also ?environment for some simple examples). This allows for lapply / eapply solutions to process multiple dataframes at once.

answered Oct 02 '22 12:10

Joris Meys

Related questions
                            
                                Annotate ggplot2 facets with number of observations per facet [duplicate]
                            
                                Text clustering with Levenshtein distances
                            
                                Fastest way to add rows for missing time steps?
                            
                                How to use the strsplit function with a period
                            
                                Is it possible to get code completion for R in Emacs ESS similar to what is available in Rstudio?
                            
                                Show correlations as an ordered list, not as a large matrix
                            
                                Print list without line numbers in R
                            
                                Why can't you use repetition quantifiers in zero-width look behind assertions?
                            
                                Load R package from character string
                            
                                Package error when running r code on command line
                            
                                Count number of distinct values in a vector
                            
                                R: Plotting a 3D surface from x, y, z
                            
                                developing shiny app as a package and deploying it to shiny server
                            
                                cut function in R- labeling without scientific notations for use in ggplot2
                            
                                Which library could be used to make a Chord diagram in R? [closed]
                            
                                Plotting contours on an irregular grid
                            
                                Producing a vector graphics image (i.e. metafile) in R suitable for printing in Word 2007
                            
                                R: get list of files but not of directories
                            
                                defining minimum point size in ggplot2 - geom_point
                            
                                "Erroneous nesting of equation structures" in using "\begin{align}" in a multi-line equation in rmarkdown to knit+pandoc pdf

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to load data quickly into R?

Tags:

performance

r

benchmarking

load

daroczig

People also ask

Video Answer

2 Answers

daroczig

Joris Meys

Recent Activity

Donate For Us