How to skip invalid rows when reading data frame from file in R?

Question

I have a large file which contains lots of data, and I'd like to read it into dataframe, but found some invalid rows. These invalid rows cause the read.table to break. I try the following method to skip invalid lines, but it seems the performance is very bad.

counts<-count.fields(textConnection(lines),sep="\001")
raw_data<-read.table(textConnection(lines[counts == 34]), sep="\001")

Is there any better way to achieve this? Thanks

Paolo · Accepted Answer

using @PaulHiemstra's sample data:

read.table("test.csv", sep = ";", fill=TRUE)

then you take care of the NAs as you wish.

Paul Hiemstra · Answer

What you could do is iterate over the lines in the file, and only add the lines that have the correct length.

I defined the following test csv file:

1;2;3;4
1;2;3;4
1;2;3
1;2;3;4

Using read.table fails:

> read.table("test.csv", sep = ";")
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :                                                                       
  line 3 did not have 4 elements

Now an iterative approach:

require(plyr)
no_lines = 4
correct_length = 4
file_con = file("test.csv", "r")
result = ldply(1:no_lines, function(line) {
   dum = strsplit(readLines(file_con, n = 1), split = ";")[[1]]
   if(length(dum) == correct_length) {
     return(dum)
   } else {
     cat(sprintf("Skipped line %s
", line))
     return(NULL)
   }
  })
close(file_con)

> result
  V1 V2 V3 V4
1  1  2  3  4
2  1  2  3  4
3  1  2  3  4

Ofcourse this is a trivial example as the file is really small. Let us create a more challenging example to act as a benchmark.

# First file with invalid rows
norow = 10e5    # number of rows
no_lines = round(runif(norow, min = 3, max = 4))
no_lines[1] = correct_length
file_content = ldply(no_lines, function(line) paste(1:line, collapse = ";"))
writeLines(paste(file_content[[1]], sep = "
"), "big_test.csv")

# Same length with valid rows
file_content = ldply(rep(4, norow), function(line) paste(1:line, collapse = ";"))
writeLines(paste(file_content[[1]], sep = "
"), "big_normal.csv")

Now for the benchmark

# Iterative approach
system.time({file_con <- file("big_test.csv", "r")
    result_test <- ldply(1:norow, function(line) {
       dum = strsplit(readLines(file_con, n = 1), split = ";")[[1]]
       if(length(dum) == correct_length) {
         return(dum)
       } else {
         # Commenting this speeds up by 30%
         #cat(sprintf("Skipped line %s
", line))
         return(NULL)
       }
      })
    close(file_con)})
   user  system elapsed 
 20.559   0.047  20.775

# Normal read.table
system.time(result_normal <- read.table("big_normal.csv", sep = ";"))
   user  system elapsed 
  1.060   0.015   1.079 

# read.table with fill = TRUE
system.time({result_fill <- read.table("big_test.csv", sep = ";", fill=TRUE)
             na_rows <- complete.cases(result_fill)
             result_fill <- result_fill[-na_rows,]})
   user  system elapsed 
  1.161   0.033   1.203 

# Specifying which type the columns are (e.g. character or numeric)
# using the colClasses argument.
system.time({result_fill <- read.table("big_test.csv", sep = ";", fill=TRUE, 
                                       colClasses = rep("numeric", 4))
             na_rows <- complete.cases(result_fill)
             result_fill <- result_fill[-na_rows,]})
   user  system elapsed 
  0.933   0.064   1.001

So the iterative approach is quite a bit slower, but 20 seconds for 1 million rows might be acceptable (although this depends on your definition of acceptable). Especially when you only have to this once, and than save it using save for later retrieval. The solution suggested by @Paolo is almost as fast as the normal call to read.table. Rows that contain the wrong amount of columns (thus NA's) are eliminated using complete.cases. Specifying which classes the columns are further improves the performance, and I think this effect is going to be bigger when the amount of columns and rows gets bigger.

So in conclusion, the best option is to use read.table with fill = TRUE, while specifying the classes of the columns. The iterative approach using ldply is only a good option if you want more flexibility in the choice of how to read the lines, e.g. only read the line if a certain value is above a threshold. But probably this could be done quicker by reading all the data into R, and than creating a subset. Only when the data is bigger than your RAM, I could imagine the iterative approach having its merits.

How to skip invalid rows when reading data frame from file in R?

Tags:

dataframe

r

zjffdu

2 Answers

Paolo

Paul Hiemstra

Recent Activity

Donate For Us

How to skip invalid rows when reading data frame from file in R?

Tags:

dataframe

r

zjffdu

2 Answers

Paolo

Paul Hiemstra

Related questions

Recent Activity

Donate For Us