Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I cut large csv files using any R packages like ff or data.table?

Tags:

r

large-data

I want to cut large csv files (file size more than RAM size) and use them or save each in disk for later usage. Which R package is best for doing this for large files?

like image 320
Alok Nayak Avatar asked May 29 '14 09:05

Alok Nayak


People also ask

How do I manage large CSV files?

So, how do you open large CSV files in Excel? Essentially, there are two options: Split the CSV file into multiple smaller files that do fit within the 1,048,576 row limit; or, Find an Excel add-in that supports CSV files with a higher number of rows.

How do I read a large CSV file in R?

If the CSV files are extremely large, the best way to import into R is using the fread() method from the data. table package. The output of the data will be in the form of Data table in this case.


3 Answers

I haven't tried but using skip and nrows parameters in read.table or read.csv is worth a try. These are from ?read.table

skip integer: the number of lines of the data file to skip before beginning to read data.

nrows integer: the maximum number of rows to read in. Negative and other invalid values are ignored.

To avoid some troublesome issues at the end you need to do some error handling. In other words I don't know what happpens when skip value is greater than the number of rows in your big csv.

p.s. I also don't know whether header=TRUE is affecting skip or not, you also have to check that.

like image 161
berkorbay Avatar answered Sep 19 '22 01:09

berkorbay


The answer given bu @berkorbay is OK and I can confirm that header can be used with skip. However, if your file is really large it gets painfully slow, as each subsequent reading after the first must skip over all previously read lines.

I had to do something similar and, after wasting quite a bit of time, I wrote a short script in PERL which fragments the original file in chuncks that you can read one after the other. It is much faster. I enclose the source here, translating some parts so that the intent is clear:

#!/usr/bin/perl
system("cls");
print("Fragment .csv file keeping header in each chunk\n") ;

print("\nEnter input file name  = ") ;
$entrada = <STDIN> ;
print("\nEnter maximum number of lines in each fragment = ") ;
$nlineas = <STDIN> ;
print("\nEnter output file name stem   = ") ;
$salida = <STDIN> ;
chop($salida) ;
open(IN,$entrada)    || die "Cannot open input file: $!\n" ;

$cabecera  = <IN> ;
$leidas    = 0  ;
$fragmento = 1  ;
$fichero   = $salida.$fragmento ;
open(OUT,">$fichero") || die "Cannot open output file: $!\n" ;
print OUT $cabecera ;
while(<IN>) {
    if ($leidas > $nlineas) {
    close(OUT) ;
    $fragmento++ ;
    $fichero   = $salida.$fragmento ;
    open(OUT,">$fichero") || die "Cannot open output file: $!\n" ;
    print OUT $cabecera ;
    $leidas = 0;
    }
    $leidas++ ;
    print OUT $_ ;
}
close(OUT) ;

Just save with whatever name and execute. The first line might have to be changed if you have PERL in a diferent place (an, if you are on Windows, you migh have to invoke the script as "perl name-of-script").

like image 22
F. Tusell Avatar answered Sep 17 '22 01:09

F. Tusell


One should have used read.csv.ffdf of ff package with specific parameters like this to read big file:

library(ff)
a <- read.csv.ffdf(file="big.csv", header=TRUE, VERBOSE=TRUE, first.rows=1000000, next.rows=1000000, colClasses=NA)

Once big file is read into a ff object, Subsetting ffobject into data frames can be done using: a[1000:1000000,]

Rest of the code for subsetting and saving broken dataframes totalrows = dim(a)[1] row.size = as.integer(object.size(a[1:10000,])) / 10000 #in bytes

block.size = 200000000  #in bytes .IN Mbs 200 Mb

#rows.block is rows per block
rows.block = ceiling(block.size/row.size)

#nmaps is the number of chunks/maps of big dataframe(ff), nmaps = number of maps - 1
nmaps = floor(totalrows/rows.block)


for(i in (0:nmaps)){
  if(i==nmaps){
    df = a[(i*rows.block+1) : totalrows,]
  }
  else{
    df = a[(i*rows.block+1) : ((i+1)*rows.block),]
  }
  #process df or save it
  write.csv(df,paste0("M",i+1,".csv"))
  #remove df
  rm(df)
}
like image 29
Alok Nayak Avatar answered Sep 21 '22 01:09

Alok Nayak