I want to cut large csv files (file size more than RAM size) and use them or save each in disk for later usage. Which R package is best for doing this for large files?

The answer given bu @berkorbay is OK and I can confirm that header can be used with skip. However, if your file is really large it gets painfully slow, as each subsequent reading after the first must skip over all previously read lines. I had to do something similar and, after wasting quite a bit of time, I wrote a short script in PERL which fragments the original file in chuncks that you can read one after the other. It is much faster. I enclose the source here, translating some parts so that the intent is clear: <pre class="prettyprint"><code>#!/usr/bin/perl system("cls"); print("Fragment .csv file keeping header in each chunk\n") ; print("\nEnter input file name = ") ; $entrada = <STDIN> ; print("\nEnter maximum number of lines in each fragment = ") ; $nlineas = <STDIN> ; print("\nEnter output file name stem = ") ; $salida = <STDIN> ; chop($salida) ; open(IN,$entrada) || die "Cannot open input file: $!\n" ; $cabecera = <IN> ; $leidas = 0 ; $fragmento = 1 ; $fichero = $salida.$fragmento ; open(OUT,">$fichero") || die "Cannot open output file: $!\n" ; print OUT $cabecera ; while(<IN>) { if ($leidas > $nlineas) { close(OUT) ; $fragmento++ ; $fichero = $salida.$fragmento ; open(OUT,">$fichero") || die "Cannot open output file: $!\n" ; print OUT $cabecera ; $leidas = 0; } $leidas++ ; print OUT $_ ; } close(OUT) ; </code></pre> Just save with whatever name and execute. The first line might have to be changed if you have PERL in a diferent place (an, if you are on Windows, you migh have to invoke the script as "perl name-of-script").

How can I cut large csv files using any R packages like ff or data.table?

3 Answers

I haven't tried but using skip and nrows parameters in read.table or read.csv is worth a try. These are from ?read.table

skip integer: the number of lines of the data file to skip before beginning to read data.

nrows integer: the maximum number of rows to read in. Negative and other invalid values are ignored.

To avoid some troublesome issues at the end you need to do some error handling. In other words I don't know what happpens when skip value is greater than the number of rows in your big csv.

p.s. I also don't know whether header=TRUE is affecting skip or not, you also have to check that.

161

answered Sep 19 '22 01:09

berkorbay

The answer given bu @berkorbay is OK and I can confirm that header can be used with skip. However, if your file is really large it gets painfully slow, as each subsequent reading after the first must skip over all previously read lines.

I had to do something similar and, after wasting quite a bit of time, I wrote a short script in PERL which fragments the original file in chuncks that you can read one after the other. It is much faster. I enclose the source here, translating some parts so that the intent is clear:

#!/usr/bin/perl
system("cls");
print("Fragment .csv file keeping header in each chunk\n") ;

print("\nEnter input file name  = ") ;
$entrada = <STDIN> ;
print("\nEnter maximum number of lines in each fragment = ") ;
$nlineas = <STDIN> ;
print("\nEnter output file name stem   = ") ;
$salida = <STDIN> ;
chop($salida) ;
open(IN,$entrada)    || die "Cannot open input file: $!\n" ;

$cabecera  = <IN> ;
$leidas    = 0  ;
$fragmento = 1  ;
$fichero   = $salida.$fragmento ;
open(OUT,">$fichero") || die "Cannot open output file: $!\n" ;
print OUT $cabecera ;
while(<IN>) {
    if ($leidas > $nlineas) {
    close(OUT) ;
    $fragmento++ ;
    $fichero   = $salida.$fragmento ;
    open(OUT,">$fichero") || die "Cannot open output file: $!\n" ;
    print OUT $cabecera ;
    $leidas = 0;
    }
    $leidas++ ;
    print OUT $_ ;
}
close(OUT) ;

Just save with whatever name and execute. The first line might have to be changed if you have PERL in a diferent place (an, if you are on Windows, you migh have to invoke the script as "perl name-of-script").

answered Sep 17 '22 01:09

F. Tusell

One should have used read.csv.ffdf of ff package with specific parameters like this to read big file:

library(ff)
a <- read.csv.ffdf(file="big.csv", header=TRUE, VERBOSE=TRUE, first.rows=1000000, next.rows=1000000, colClasses=NA)

Once big file is read into a ff object, Subsetting ffobject into data frames can be done using: a[1000:1000000,]

Rest of the code for subsetting and saving broken dataframes totalrows = dim(a)[1] row.size = as.integer(object.size(a[1:10000,])) / 10000 #in bytes

block.size = 200000000  #in bytes .IN Mbs 200 Mb

#rows.block is rows per block
rows.block = ceiling(block.size/row.size)

#nmaps is the number of chunks/maps of big dataframe(ff), nmaps = number of maps - 1
nmaps = floor(totalrows/rows.block)


for(i in (0:nmaps)){
  if(i==nmaps){
    df = a[(i*rows.block+1) : totalrows,]
  }
  else{
    df = a[(i*rows.block+1) : ((i+1)*rows.block),]
  }
  #process df or save it
  write.csv(df,paste0("M",i+1,".csv"))
  #remove df
  rm(df)
}

answered Sep 21 '22 01:09

Alok Nayak

Related questions
                            
                                Convert XMLInternalDocument to character vector
                            
                                Error in download.file unsupported URL scheme
                            
                                Optimization issue, Nonlinear: automatic analytical Jacobian/Hessian from objective and constraints in R?
                            
                                protect specific words, delete letters from string
                            
                                Overlapping labels ggmap
                            
                                Implementation of nextafter functionality in R
                            
                                Best practices to handle personal functions in R
                            
                                R- Optimx for exponential function with 2 parameters - cannot evaluate function at initial parameter values
                            
                                Sort list of lists by length of member lists
                            
                                Efficiently randomly drawing from a multivariate normal distribution
                            
                                Extract indices from array meeting a condition in R
                            
                                Applying a function over consecutive pairs of list elements in R without loops
                            
                                How can I draw a CART tree in Python, as I can in R?
                            
                                R: respect quotes around numbers (treat as character) with read.csv()?
                            
                                Python numpy or pandas equivalent of the R function sweep()
                            
                                Find duration of wav file in r
                            
                                In R, is it possible to suppress "Note: no visible binding for global variable"?
                            
                                How does R deal with special characters in regulars expressions?
                            
                                Replace value in data frame if value is greater or smaller
                            
                                R Circlize "Detect some gaps are too large"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I cut large csv files using any R packages like ff or data.table?

Tags:

r

large-data

Alok Nayak

People also ask

3 Answers

berkorbay

F. Tusell

Alok Nayak

Recent Activity

Donate For Us