I am engaged in data cleaning. I have a function that identifies bad rows in a large input file (too big to read at one go, given my ram size) and returns the row numbers of the bad rows as a vector badRows
. This function seems to work.
I am now trying to read just the bad rows into a data frame, so far unsuccessfully.
My current approach is to use read.table
on an open connection to my file, using a vector of the number of rows to skip between each row that is read. This number is zero for consecutive bad rows.
I calculate skipVec
as:
(badRowNumbers - c(0, badRowNumbers[1:(length(badRowNumbers-1]))-1
But for the moment I am just handing my function a skipVec
vector of all zeros.
If my logic is correct, this should return all the rows. It does not. Instead I get an error:
"Error in read.table(con, skip = pass, nrow = 1, header = TRUE, sep = "") : no lines available in input"
My current function is loosely based on a function by Miron Kursa ("mbq"), which I found here.
My question is somewhat duplicative of that one, but I assume his function works, so I have broken it somehow. I am still trying to understand the difference between opening a file and opening a connection to a file, and I suspect that the problem is there somewhere, or in my use of lapply
.
I am running R 3.0.1 under RStudio 0.97.551 on a cranky old Windows XP SP3 machine with 3gig of ram. Stone Age, I know.
Here is the code that produces the error message above:
# Make a small small test data frame, write it to a file, and read it back in
# a row at a time.
testThis.DF <- data.frame(nnn=c(2,3,5), fff=c("aa", "bb", "cc"))
testThis.DF
# This function will work only if the number of bad rows is not too big for memory
write.table(testThis.DF, "testThis.DF")
con<-file("testThis.DF")
open(con)
skipVec <- c(0,0,0)
badRows.DF <- lapply(skipVec, FUN=function(pass){
read.table(con, skip=pass, nrow=1, header=TRUE, sep="") })
close(con)
The error occurs before the close command. If I yank the readLines command out of the lapply and the function and just stick it in by itself, I still get the same error.
Loading a large dataset: use fread() or functions from readr instead of read. xxx() . If you really need to read an entire csv in memory, by default, R users use the read. table method or variations thereof (such as read.
Read Lines from a File in R Programming – readLines() Function. readLines() function in R Language reads text lines from an input file. The readLines() function is perfect for text files since it reads the text line by line and creates character objects for each of the lines.
Method 3: Using fread() method If the CSV files are extremely large, the best way to import into R is using the fread() method from the data. table package. The output of the data will be in the form of Data table in this case.
If instead of running read.table
through lapply
you just run the first few iterations manually, you will see what is going on:
> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
nnn fff
1 2 aa
> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
X2 X3 bb
1 3 5 cc
Because header = TRUE
it is not one line that is read at each iteration but two, so you eventually run out of lines faster than you think, here on the third iteration:
> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
Error in read.table(con, skip = 0, nrow = 1, header = TRUE, sep = "") :
no lines available in input
Now this might still not be a very efficient way of solving your problem, but this is how you can fix your current code:
write.table(testThis.DF, "testThis.DF")
con <- file("testThis.DF")
open(con)
header <- scan(con, what = character(), nlines = 1, quiet = TRUE)
skipVec <- c(0,1,0)
badRows <- lapply(skipVec, function(pass){
line <- read.table(con, nrow = 1, header = FALSE, sep = "",
row.names = 1)
if (pass) NULL else line
})
badRows.DF <- setNames(do.call(rbind, badRows), header)
close(con)
Some clues towards higher speeds:
scan
instead of read.table
. Read data as character
and only at the end, after you have put your data into a character matrix or data.frame, apply type.convert
to each column.skipVec
, loop over its rle
if it is much shorter. So you'll be able to read or skip chunks of lines at a time. If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With