Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use zoo or xts with large data?

How can I use the R packages zoo or xts with very large data sets? (100GB) I know there are some packages such as bigrf, ff, bigmemory that can deal with this problem but you have to use their limited set of commands, they don't have the functions of zoo or xts and I don't know how to make zoo or xts to use them. How can I use it?

I've seen that there are also some other things, related with databases, such as sqldf and hadoopstreaming, RHadoop, or some other used by Revolution R. What do you advise?, any other?

I just want to aggreagate series, cleanse, and perform some cointegrations and plots. I wouldn't like to need to code and implement new functions for every command I need, using small pieces of data every time.

Added: I'm on Windows

like image 472
skan Avatar asked Mar 27 '13 00:03

skan


People also ask

What is XTS dataset?

eXtensible Time Series (xts) is a powerful package that provides an extensible time series class, enabling uniform handling of many R time series classes by extending zoo. Load the package as follows: library(xts) Powered by Datacamp Workspace.

What does zoo do in R?

zoo is an R package providing an S3 class with methods for indexed totally ordered observations, such as discrete irregular time series. Its key design goals are independence of a particular index/time/date class and consistency with base R and the "ts" class for regular time series.

How do I convert data to time series in R?

To convert the given dataframe with the date column to the time series object, the user first needs to import and load the xts package. The user then needs to call the xts() function with the required parameters the main need to call this function is to create the time-series object in R language and at the end use is.


1 Answers

I have had a similar problem (albeit I was only playing with 9-10 GBs). My experience is that there is no way R can handle so much data on its own, especially since your dataset appears to contain time series data.

If your dataset contains a lot of zeros, you may be able to handle it using sparse matrices - see Matrix package ( http://cran.r-project.org/web/packages/Matrix/index.html ); this manual may also come handy ( http://www.johnmyleswhite.com/notebook/2011/10/31/using-sparse-matrices-in-r/ )

I used PostgreSQL - the relevant R package is RPostgreSQL ( http://cran.r-project.org/web/packages/RPostgreSQL/index.html ). It allows you to query your PostgreSQL database; it uses SQL syntax. Data is downloaded into R as a dataframe. It may be slow (depending on the complexity of your query), but it is robust and can be handy for data aggregation.

Drawback: you would need to upload data into the database first. Your raw data needs to be clean and saved in some readable format (txt/csv). This is likely to be the biggest issue if your data is not already in a sensible format. Yet uploading "well-behaved" data into the DB is easy ( see http://www.postgresql.org/docs/8.2/static/sql-copy.html and How to import CSV file data into a PostgreSQL table? )

I would recommend using PostgreSQL or any other relational database for your task. I did not try Hadoop, but using CouchDB nearly drove me round the bend. Stick with good old SQL

like image 127
Skif Avatar answered Oct 11 '22 23:10

Skif