Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

fastest way to copy the first X lines from one file to another within R? (cross-platform)

Tags:

io

r

large-files

i cannot load the file into RAM (assume a user might want the first billion of a file with ten billion records)

here is my solution, but i think there has got to be a faster way?

thanks

# specified by the user
infile <- "/some/big/file.txt"
outfile <- "/some/smaller/file.txt"
num_lines <- 1000


# my attempt
incon <- file( infile , "r") 
outcon <- file( outfile , "w") 

for ( i in seq( num_lines ) ){

    line <- readLines( incon , 1 )

    writeLines( line , outcon )

}

close( incon )
close( outcon )
like image 544
Anthony Damico Avatar asked Nov 17 '15 10:11

Anthony Damico


2 Answers

You can use ff::read.table.ffdf for this. It stores the data on the hard disk and it does not use any RAM.

library(ff)
infile <- read.table.ffdf(file = "/some/big/file.txt")

Essentially you can use the above function in the same way as base::read.table with the difference that the resulting object will be stored on the hard disk.

You can also use the nrow argument and load specific number of rows. The documentation is here if you want to have a read. Once, you have read the file, then you can subset the specific rows you need and even convert them to data.frames if they can fit the RAM.

There is also a write.table.ffdf function that will allow you to write an ffdf object (resulting from read.table.ffdf) which will make the process even easier.


As an example of how to use read.table.ffdf (or read.delim.ffdf which is pretty much the same thing) see the following:

#writting a file on my current directory
#note that there is no standard number of columns
sink(file='test.txt')
cat('foo , foo, foo\n')
cat('foo, foo\n')
cat('bar bar , bar\n')
sink()

#read it with read.delim.ffdf or read.table.ffdf
read.delim.ffdf(file='test.txt', sep='\n', header=F)

Output:

ffdf (all open) dim=c(3,1), dimorder=c(1,2) row.names=NULL
ffdf virtual mapping
   PhysicalName VirtualVmode PhysicalVmode  AsIs VirtualIsMatrix PhysicalIsMatrix PhysicalElementNo PhysicalFirstCol PhysicalLastCol PhysicalIsOpen
V1           V1      integer       integer FALSE           FALSE            FALSE                 1                1               1           TRUE
ffdf data
              V1
1 foo , foo, foo
2 foo, foo      
3 bar bar , bar 

If you are using a txt file then this is a general solution as each line will finish with a \n character.

like image 150
LyzandeR Avatar answered Sep 20 '22 15:09

LyzandeR


C++ solution

It is not too difficult to write some c++ code for this:

#include <fstream>
#include <R.h>
#include <Rdefines.h>

extern "C" {

  // [[Rcpp::export]]
  SEXP dump_n_lines(SEXP rin, SEXP rout, SEXP rn) {
    // no checks on types and size
    std::ifstream strin(CHAR(STRING_ELT(rin, 0)));
    std::ofstream strout(CHAR(STRING_ELT(rout, 0)));
    int N = INTEGER(rn)[0];

    int n = 0;
    while (strin && n < N) {
      char c = strin.get();
      if (c == '\n') ++n;
      strout.put(c);
    }

    strin.close();
    strout.close();
    return R_NilValue;
  }
}

When saved as yourfile.cpp, you can do

Rcpp::sourceCpp('yourfile.cpp')

From RStudio you don't have to load anything. In the console you will have to load Rcpp. You will probably have to install Rtools in Windows.

More efficient R-code

By reading larger blocks instead of single lines your code will also speed up:

dump_n_lines2 <- function(infile, outfile, num_lines, block_size = 1E6) {
  incon <- file( infile , "r") 
  outcon <- file( outfile , "w") 

  remain <- num_lines

  while (remain > 0) {
    size <- min(remain, block_size)
    lines <- readLines(incon , n = size)
    writeLines(lines , outcon)
    # check for eof:
    if (length(lines) < size) break 
    remain <- remain - size
  }
  close( incon )
  close( outcon )
}

Benchmark

lines <- "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean commodo
imperdiet nunc, vel ultricies felis tincidunt sit amet. Aliquam id nulla eu mi
luctus vestibulum ac at leo. Integer ultrices, mi sit amet laoreet dignissim,
orci ligula laoreet diam, id elementum lorem enim in metus. Quisque orci neque,
vulputate ultrices ornare ac, interdum nec nunc. Suspendisse iaculis varius
dapibus. Donec eget placerat est, ac iaculis ipsum. Pellentesque rhoncus
maximus ipsum in hendrerit. Donec finibus posuere libero, vitae semper neque
faucibus at. Proin sagittis lacus ut augue sagittis pulvinar. Nulla fermentum
interdum orci, sed imperdiet nibh. Aliquam tincidunt turpis sit amet elementum
porttitor. Aliquam lectus dui, dapibus ut consectetur id, mollis quis magna.
Donec dapibus ac magna id bibendum."
lines <- rep(lines, 1E6)
writeLines(lines, con = "big.txt")

infile <- "big.txt"
outfile <- "small.txt"
num_lines <- 1E6L


library(microbenchmark)
microbenchmark(
  solution0(infile, outfile, num_lines),
  dump_n_lines2(infile, outfile, num_lines),
  dump_n_lines(infile, outfile, num_lines)
  )

Results in (solution0 is the OP's original solution):

Unit: seconds
                                     expr       min        lq      mean    median        uq       max neval cld
    solution0(infile, outfile, num_lines) 11.523184 12.394079 12.635808 12.600581 12.904857 13.792251   100   c
dump_n_lines2(infile, outfile, num_lines)  6.745558  7.666935  7.926873  7.849393  8.297805  9.178277   100  b 
 dump_n_lines(infile, outfile, num_lines)  1.852281  2.411066  2.776543  2.844098  2.965970  4.081520   100 a 

The c++ solution can probably be sped up by reading in large blocks of data at a time. However, this will make the code much more complex. Unless this is something I would have to do on a very regular basis, I would probably stick with the pure R solution.

Remark: when your data is tabular, you can use my LaF package to read arbitrary lines and columns from your data set without having to read all of the data into memory.

like image 22
Jan van der Laan Avatar answered Sep 19 '22 15:09

Jan van der Laan