I have a large file which contains lots of data, and I'd like to read it into dataframe, but found some invalid rows. These invalid rows cause the read.table to break. I try the following method to skip invalid lines, but it seems the performance is very bad.
counts<-count.fields(textConnection(lines),sep="\001")
raw_data<-read.table(textConnection(lines[counts == 34]), sep="\001")
Is there any better way to achieve this? Thanks
using @PaulHiemstra's sample data:
read.table("test.csv", sep = ";", fill=TRUE)
then you take care of the NAs as you wish.
What you could do is iterate over the lines in the file, and only add the lines that have the correct length.
I defined the following test csv file:
1;2;3;4
1;2;3;4
1;2;3
1;2;3;4
Using read.table
fails:
> read.table("test.csv", sep = ";")
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 3 did not have 4 elements
Now an iterative approach:
require(plyr)
no_lines = 4
correct_length = 4
file_con = file("test.csv", "r")
result = ldply(1:no_lines, function(line) {
dum = strsplit(readLines(file_con, n = 1), split = ";")[[1]]
if(length(dum) == correct_length) {
return(dum)
} else {
cat(sprintf("Skipped line %s\n", line))
return(NULL)
}
})
close(file_con)
> result
V1 V2 V3 V4
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4
Ofcourse this is a trivial example as the file is really small. Let us create a more challenging example to act as a benchmark.
# First file with invalid rows
norow = 10e5 # number of rows
no_lines = round(runif(norow, min = 3, max = 4))
no_lines[1] = correct_length
file_content = ldply(no_lines, function(line) paste(1:line, collapse = ";"))
writeLines(paste(file_content[[1]], sep = "\n"), "big_test.csv")
# Same length with valid rows
file_content = ldply(rep(4, norow), function(line) paste(1:line, collapse = ";"))
writeLines(paste(file_content[[1]], sep = "\n"), "big_normal.csv")
Now for the benchmark
# Iterative approach
system.time({file_con <- file("big_test.csv", "r")
result_test <- ldply(1:norow, function(line) {
dum = strsplit(readLines(file_con, n = 1), split = ";")[[1]]
if(length(dum) == correct_length) {
return(dum)
} else {
# Commenting this speeds up by 30%
#cat(sprintf("Skipped line %s\n", line))
return(NULL)
}
})
close(file_con)})
user system elapsed
20.559 0.047 20.775
# Normal read.table
system.time(result_normal <- read.table("big_normal.csv", sep = ";"))
user system elapsed
1.060 0.015 1.079
# read.table with fill = TRUE
system.time({result_fill <- read.table("big_test.csv", sep = ";", fill=TRUE)
na_rows <- complete.cases(result_fill)
result_fill <- result_fill[-na_rows,]})
user system elapsed
1.161 0.033 1.203
# Specifying which type the columns are (e.g. character or numeric)
# using the colClasses argument.
system.time({result_fill <- read.table("big_test.csv", sep = ";", fill=TRUE,
colClasses = rep("numeric", 4))
na_rows <- complete.cases(result_fill)
result_fill <- result_fill[-na_rows,]})
user system elapsed
0.933 0.064 1.001
So the iterative approach is quite a bit slower, but 20 seconds for 1 million rows might be acceptable (although this depends on your definition of acceptable). Especially when you only have to this once, and than save it using save
for later retrieval. The solution suggested by @Paolo is almost as fast as the normal call to read.table
. Rows that contain the wrong amount of columns (thus NA
's) are eliminated using complete.cases
. Specifying which classes the columns are further improves the performance, and I think this effect is going to be bigger when the amount of columns and rows gets bigger.
So in conclusion, the best option is to use read.table
with fill = TRUE
, while specifying the classes of the columns. The iterative approach using ldply
is only a good option if you want more flexibility in the choice of how to read the lines, e.g. only read the line if a certain value is above a threshold. But probably this could be done quicker by reading all the data into R, and than creating a subset. Only when the data is bigger than your RAM, I could imagine the iterative approach having its merits.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With