Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to load big csv file with mixed-type columns using the bigmemory package

Is there a way to combine the use of scan() and read.big.matrix() from the bigmemory package to read in a 200 MB .csv file with mixed-type columns so that the result is a dataframe with integer, character, and numeric columns?

like image 237
Lourdes Avatar asked Aug 07 '11 04:08

Lourdes


2 Answers

Try the ff package for this.

library(ff)
help(read.table.ffdf)

Function ‘read.table.ffdf’ reads separated flat files into ‘ffdf’ objects, very much like (and using) ‘read.table’. It can also work with any convenience wrappers like ‘read.csv’ and provides its own convenience wrapper (e.g. ‘read.csv.ffdf’) for R's usual wrappers.

For 200Mb it should be as simple a task as this.

 x <- read.csv.ffdf(file=csvfile)

(For much bigger files it will likely require that you investigate some of the configuration options, depending on your machine and OS).

like image 167
mdsumner Avatar answered Sep 30 '22 13:09

mdsumner


Ah, there are some things that are impossible in this life, and there are some that are misunderstood and lead to unpleasant situations. @Roman is right: a matrix must be of one atomic type. It's not a dataframe.

Since a matrix must be of one type, attempting to snooker bigmemory to handle multiple types is, in itself, a bad thing. Could it be done? I'm not going there. Why? Because everything else will assume that it's getting a matrix, not a dataframe. That will lead to more questions and more sorrow.

Now, what you can do is to identify the types of each of the columns, and generate a set of distinct bigmemory files, each containing the items that are of a particular type. E.g. charBM = character big matrix, intBM = integer big matrix, and so on. Then, you may be able to develop have a wrapper that produces a data frame out of all of this. Still I don't recommend that: treat the different items as what they are, or coerce homogeneity if you can, rather than try to produce a big dataframe griffin.

@mdsumner is correct in suggesting ff. Another storage option is HDF5, which you can access through ncdf4 in R. Unfortunately, these other packages are not as pleasant as bigmemory.

like image 29
Iterator Avatar answered Sep 30 '22 14:09

Iterator