I am engaged in data cleaning. I have a function that identifies bad rows in a large input file (too big to read at one go, given my ram size) and returns the row numbers of the bad rows as a vector <code>badRows</code>. This function seems to work. I am now trying to read just the bad rows into a data frame, so far unsuccessfully. My current approach is to use <code>read.table</code> on an open connection to my file, using a vector of the number of rows to skip between each row that is read. This number is zero for consecutive bad rows. I calculate <code>skipVec</code> as: <pre class="prettyprint"><code>(badRowNumbers - c(0, badRowNumbers[1:(length(badRowNumbers-1]))-1 </code></pre> But for the moment I am just handing my function a <code>skipVec</code> vector of all zeros. If my logic is correct, this should return all the rows. It does not. Instead I get an error: <blockquote> "Error in read.table(con, skip = pass, nrow = 1, header = TRUE, sep = "") : no lines available in input" </blockquote> My current function is loosely based on a function by Miron Kursa ("mbq"), which I found here. My question is somewhat duplicative of that one, but I assume his function works, so I have broken it somehow. I am still trying to understand the difference between opening a file and opening a connection to a file, and I suspect that the problem is there somewhere, or in my use of <code>lapply</code>. I am running R 3.0.1 under RStudio 0.97.551 on a cranky old Windows XP SP3 machine with 3gig of ram. Stone Age, I know. Here is the code that produces the error message above: <pre class="prettyprint"><code># Make a small small test data frame, write it to a file, and read it back in # a row at a time. testThis.DF <- data.frame(nnn=c(2,3,5), fff=c("aa", "bb", "cc")) testThis.DF # This function will work only if the number of bad rows is not too big for memory write.table(testThis.DF, "testThis.DF") con<-file("testThis.DF") open(con) skipVec <- c(0,0,0) badRows.DF <- lapply(skipVec, FUN=function(pass){ read.table(con, skip=pass, nrow=1, header=TRUE, sep="") }) close(con) </code></pre> The error occurs before the close command. If I yank the readLines command out of the lapply and the function and just stick it in by itself, I still get the same error.

If instead of running <code>read.table</code> through <code>lapply</code> you just run the first few iterations manually, you will see what is going on: <pre class="prettyprint"><code>> read.table(con, skip=0, nrow=1, header=TRUE, sep="") nnn fff 1 2 aa > read.table(con, skip=0, nrow=1, header=TRUE, sep="") X2 X3 bb 1 3 5 cc </code></pre> Because <code>header = TRUE</code> it is not one line that is read at each iteration but two, so you eventually run out of lines faster than you think, here on the third iteration: <pre class="prettyprint"><code>> read.table(con, skip=0, nrow=1, header=TRUE, sep="") Error in read.table(con, skip = 0, nrow = 1, header = TRUE, sep = "") : no lines available in input </code></pre> Now this might still not be a very efficient way of solving your problem, but this is how you can fix your current code: <pre class="prettyprint"><code>write.table(testThis.DF, "testThis.DF") con <- file("testThis.DF") open(con) header <- scan(con, what = character(), nlines = 1, quiet = TRUE) skipVec <- c(0,1,0) badRows <- lapply(skipVec, function(pass){ line <- read.table(con, nrow = 1, header = FALSE, sep = "", row.names = 1) if (pass) NULL else line }) badRows.DF <- setNames(do.call(rbind, badRows), header) close(con) </code></pre> Some clues towards higher speeds: <ol> <li>use <code>scan</code> instead of <code>read.table</code>. Read data as <code>character</code> and only at the end, after you have put your data into a character matrix or data.frame, apply <code>type.convert</code> to each column.</li> <li>Instead of looping over <code>skipVec</code>, loop over its <code>rle</code> if it is much shorter. So you'll be able to read or skip chunks of lines at a time. </li> </ol>

How can I read selected rows from a large file using the R "readLines" command and write them to a data frame?

Tags:

import

r

connection

bigdata

I am engaged in data cleaning. I have a function that identifies bad rows in a large input file (too big to read at one go, given my ram size) and returns the row numbers of the bad rows as a vector badRows. This function seems to work.

I am now trying to read just the bad rows into a data frame, so far unsuccessfully.

My current approach is to use read.table on an open connection to my file, using a vector of the number of rows to skip between each row that is read. This number is zero for consecutive bad rows.

I calculate skipVec as:

(badRowNumbers - c(0, badRowNumbers[1:(length(badRowNumbers-1]))-1

But for the moment I am just handing my function a skipVec vector of all zeros.

If my logic is correct, this should return all the rows. It does not. Instead I get an error:

"Error in read.table(con, skip = pass, nrow = 1, header = TRUE, sep = "") : no lines available in input"

My current function is loosely based on a function by Miron Kursa ("mbq"), which I found here.

My question is somewhat duplicative of that one, but I assume his function works, so I have broken it somehow. I am still trying to understand the difference between opening a file and opening a connection to a file, and I suspect that the problem is there somewhere, or in my use of lapply.

I am running R 3.0.1 under RStudio 0.97.551 on a cranky old Windows XP SP3 machine with 3gig of ram. Stone Age, I know.

Here is the code that produces the error message above:

# Make a small small test data frame, write it to a file, and read it back in 
# a row at a time.
testThis.DF <- data.frame(nnn=c(2,3,5), fff=c("aa", "bb", "cc"))  
testThis.DF 

# This function will work only if the number of bad rows is not too big for memory
write.table(testThis.DF, "testThis.DF")
con<-file("testThis.DF")
open(con)
skipVec <- c(0,0,0)
badRows.DF  <- lapply(skipVec, FUN=function(pass){
  read.table(con, skip=pass, nrow=1, header=TRUE, sep="") })
close(con)

The error occurs before the close command. If I yank the readLines command out of the lapply and the function and just stick it in by itself, I still get the same error.

580

asked Oct 06 '13 02:10

andrewH

1 Answers

If instead of running read.table through lapply you just run the first few iterations manually, you will see what is going on:

> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
  nnn fff
1   2  aa
> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
  X2 X3 bb
1  3  5 cc

Because header = TRUE it is not one line that is read at each iteration but two, so you eventually run out of lines faster than you think, here on the third iteration:

> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
Error in read.table(con, skip = 0, nrow = 1, header = TRUE, sep = "") : 
  no lines available in input

Now this might still not be a very efficient way of solving your problem, but this is how you can fix your current code:

write.table(testThis.DF, "testThis.DF")
con <- file("testThis.DF")
open(con)
header <- scan(con, what = character(), nlines = 1, quiet = TRUE)
skipVec <- c(0,1,0)
badRows <- lapply(skipVec, function(pass){
  line <- read.table(con, nrow = 1, header = FALSE, sep = "",
                     row.names = 1)
  if (pass) NULL else line
  })
badRows.DF <- setNames(do.call(rbind, badRows), header)
close(con)

Some clues towards higher speeds:

use scan instead of read.table. Read data as character and only at the end, after you have put your data into a character matrix or data.frame, apply type.convert to each column.
Instead of looping over skipVec, loop over its rle if it is much shorter. So you'll be able to read or skip chunks of lines at a time.

answered Sep 22 '22 13:09

flodel

Related questions
                            
                                R proportion confidence interval factor
                            
                                Can R read a zipped XLS file from a URL?
                            
                                R better way to replace matrix elements with zeroes in symetric matrix
                            
                                Extracting knot points from glm when using bs in R as a variable
                            
                                Changing columns positions in a data frame without total reassignment
                            
                                functional programming in R
                            
                                How to set a weighted least-squares in r for heteroscedastic data?
                            
                                What is the R equivalent of matlab's csaps()
                            
                                Table of vector's means by two factors
                            
                                How to improve performance of this linear interpolation
                            
                                Convert multiple list elements to separate data.frame columns
                            
                                Initiate downloadHandler with clientData in Shiny
                            
                                TclTk library issue while install Rcmdr package on MacBookPro [duplicate]
                            
                                Replacement for unique(rbind()) when using data.tables
                            
                                Time Series analysis with R, how to deal with daily data
                            
                                predict with kernlab package error Error in .local(object, ...) : test vector does not match model R
                            
                                R convert vector of numbers to skipping indexes
                            
                                Knitr behavior with date objects
                            
                                Adjust position and font size of legend title in ggplot2
                            
                                Some R packages do not update with update.packages()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With