Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to skip invalid rows when reading data frame from file in R?

Tags:

dataframe

r

I have a large file which contains lots of data, and I'd like to read it into dataframe, but found some invalid rows. These invalid rows cause the read.table to break. I try the following method to skip invalid lines, but it seems the performance is very bad.

counts<-count.fields(textConnection(lines),sep="\001")
raw_data<-read.table(textConnection(lines[counts == 34]), sep="\001")

Is there any better way to achieve this? Thanks

like image 658
zjffdu Avatar asked May 15 '12 10:05

zjffdu


2 Answers

using @PaulHiemstra's sample data:

read.table("test.csv", sep = ";", fill=TRUE)

then you take care of the NAs as you wish.

like image 111
Paolo Avatar answered Nov 14 '22 04:11

Paolo


What you could do is iterate over the lines in the file, and only add the lines that have the correct length.

I defined the following test csv file:

1;2;3;4
1;2;3;4
1;2;3
1;2;3;4

Using read.table fails:

> read.table("test.csv", sep = ";")
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :                                                                       
  line 3 did not have 4 elements   

Now an iterative approach:

require(plyr)
no_lines = 4
correct_length = 4
file_con = file("test.csv", "r")
result = ldply(1:no_lines, function(line) {
   dum = strsplit(readLines(file_con, n = 1), split = ";")[[1]]
   if(length(dum) == correct_length) {
     return(dum)
   } else {
     cat(sprintf("Skipped line %s\n", line))
     return(NULL)
   }
  })
close(file_con)

> result
  V1 V2 V3 V4
1  1  2  3  4
2  1  2  3  4
3  1  2  3  4

Ofcourse this is a trivial example as the file is really small. Let us create a more challenging example to act as a benchmark.

# First file with invalid rows
norow = 10e5    # number of rows
no_lines = round(runif(norow, min = 3, max = 4))
no_lines[1] = correct_length
file_content = ldply(no_lines, function(line) paste(1:line, collapse = ";"))
writeLines(paste(file_content[[1]], sep = "\n"), "big_test.csv")

# Same length with valid rows
file_content = ldply(rep(4, norow), function(line) paste(1:line, collapse = ";"))
writeLines(paste(file_content[[1]], sep = "\n"), "big_normal.csv")

Now for the benchmark

# Iterative approach
system.time({file_con <- file("big_test.csv", "r")
    result_test <- ldply(1:norow, function(line) {
       dum = strsplit(readLines(file_con, n = 1), split = ";")[[1]]
       if(length(dum) == correct_length) {
         return(dum)
       } else {
         # Commenting this speeds up by 30%
         #cat(sprintf("Skipped line %s\n", line))
         return(NULL)
       }
      })
    close(file_con)})
   user  system elapsed 
 20.559   0.047  20.775

# Normal read.table
system.time(result_normal <- read.table("big_normal.csv", sep = ";"))
   user  system elapsed 
  1.060   0.015   1.079 

# read.table with fill = TRUE
system.time({result_fill <- read.table("big_test.csv", sep = ";", fill=TRUE)
             na_rows <- complete.cases(result_fill)
             result_fill <- result_fill[-na_rows,]})
   user  system elapsed 
  1.161   0.033   1.203 

# Specifying which type the columns are (e.g. character or numeric)
# using the colClasses argument.
system.time({result_fill <- read.table("big_test.csv", sep = ";", fill=TRUE, 
                                       colClasses = rep("numeric", 4))
             na_rows <- complete.cases(result_fill)
             result_fill <- result_fill[-na_rows,]})
   user  system elapsed 
  0.933   0.064   1.001

So the iterative approach is quite a bit slower, but 20 seconds for 1 million rows might be acceptable (although this depends on your definition of acceptable). Especially when you only have to this once, and than save it using save for later retrieval. The solution suggested by @Paolo is almost as fast as the normal call to read.table. Rows that contain the wrong amount of columns (thus NA's) are eliminated using complete.cases. Specifying which classes the columns are further improves the performance, and I think this effect is going to be bigger when the amount of columns and rows gets bigger.

So in conclusion, the best option is to use read.table with fill = TRUE, while specifying the classes of the columns. The iterative approach using ldply is only a good option if you want more flexibility in the choice of how to read the lines, e.g. only read the line if a certain value is above a threshold. But probably this could be done quicker by reading all the data into R, and than creating a subset. Only when the data is bigger than your RAM, I could imagine the iterative approach having its merits.

like image 5
Paul Hiemstra Avatar answered Nov 14 '22 06:11

Paul Hiemstra