Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I read selected rows from a large file using the R "readLines" command and write them to a data frame?

I am engaged in data cleaning. I have a function that identifies bad rows in a large input file (too big to read at one go, given my ram size) and returns the row numbers of the bad rows as a vector badRows. This function seems to work.

I am now trying to read just the bad rows into a data frame, so far unsuccessfully.

My current approach is to use read.table on an open connection to my file, using a vector of the number of rows to skip between each row that is read. This number is zero for consecutive bad rows.

I calculate skipVec as:

(badRowNumbers - c(0, badRowNumbers[1:(length(badRowNumbers-1]))-1

But for the moment I am just handing my function a skipVec vector of all zeros.

If my logic is correct, this should return all the rows. It does not. Instead I get an error:

"Error in read.table(con, skip = pass, nrow = 1, header = TRUE, sep = "") : no lines available in input"

My current function is loosely based on a function by Miron Kursa ("mbq"), which I found here.

My question is somewhat duplicative of that one, but I assume his function works, so I have broken it somehow. I am still trying to understand the difference between opening a file and opening a connection to a file, and I suspect that the problem is there somewhere, or in my use of lapply.

I am running R 3.0.1 under RStudio 0.97.551 on a cranky old Windows XP SP3 machine with 3gig of ram. Stone Age, I know.

Here is the code that produces the error message above:

# Make a small small test data frame, write it to a file, and read it back in 
# a row at a time.
testThis.DF <- data.frame(nnn=c(2,3,5), fff=c("aa", "bb", "cc"))  
testThis.DF 

# This function will work only if the number of bad rows is not too big for memory
write.table(testThis.DF, "testThis.DF")
con<-file("testThis.DF")
open(con)
skipVec <- c(0,0,0)
badRows.DF  <- lapply(skipVec, FUN=function(pass){
  read.table(con, skip=pass, nrow=1, header=TRUE, sep="") })
close(con)

The error occurs before the close command. If I yank the readLines command out of the lapply and the function and just stick it in by itself, I still get the same error.

like image 580
andrewH Avatar asked Oct 06 '13 02:10

andrewH


People also ask

How do I read a large file in R?

Loading a large dataset: use fread() or functions from readr instead of read. xxx() . If you really need to read an entire csv in memory, by default, R users use the read. table method or variations thereof (such as read.

How do I read a specific line in a text file in R?

Read Lines from a File in R Programming – readLines() Function. readLines() function in R Language reads text lines from an input file. The readLines() function is perfect for text files since it reads the text line by line and creates character objects for each of the lines.

How do I read a large csv file in R?

Method 3: Using fread() method If the CSV files are extremely large, the best way to import into R is using the fread() method from the data. table package. The output of the data will be in the form of Data table in this case.


1 Answers

If instead of running read.table through lapply you just run the first few iterations manually, you will see what is going on:

> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
  nnn fff
1   2  aa
> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
  X2 X3 bb
1  3  5 cc

Because header = TRUE it is not one line that is read at each iteration but two, so you eventually run out of lines faster than you think, here on the third iteration:

> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
Error in read.table(con, skip = 0, nrow = 1, header = TRUE, sep = "") : 
  no lines available in input

Now this might still not be a very efficient way of solving your problem, but this is how you can fix your current code:

write.table(testThis.DF, "testThis.DF")
con <- file("testThis.DF")
open(con)
header <- scan(con, what = character(), nlines = 1, quiet = TRUE)
skipVec <- c(0,1,0)
badRows <- lapply(skipVec, function(pass){
  line <- read.table(con, nrow = 1, header = FALSE, sep = "",
                     row.names = 1)
  if (pass) NULL else line
  })
badRows.DF <- setNames(do.call(rbind, badRows), header)
close(con)

Some clues towards higher speeds:

  1. use scan instead of read.table. Read data as character and only at the end, after you have put your data into a character matrix or data.frame, apply type.convert to each column.
  2. Instead of looping over skipVec, loop over its rle if it is much shorter. So you'll be able to read or skip chunks of lines at a time.
like image 66
flodel Avatar answered Sep 22 '22 13:09

flodel