Is it possible to direct fread within data.table package to skip erroneous rows

Tags:

data.table 1.9.2

I'm reading in a large table and there appears to be at least one row which produces an error of the following nature:

Error in fread(paste(base_dir, filename, sep = "")) : 
Expected sep ('|') but '' ends field 23 on line 190333 when reading data:...

Is it possible to direct fread in data.table package to skip erroneous rows?

Or any other way I can work around this sort of error in the future?

300

asked May 12 '14 04:05

bibzzzz

2 Answers

One workaround if you wish to skip erroneous rows:

First read in the file only separating according to new rows by using sep="\n" then count the number of separators for each row and filter for the correct # of separators then collapse the data and separate according to the true column separator. see example below.

example data:

require(data.table)

wrong <- fread("
var1|var2|var3|var4
a|1|10|TRUE
b|2|10|FALSE
c|3|10FALSE      # note the missing separator between 10 and FALSE.
d|4|10|TRUE
e|5|10|TRUE",sep="\n")

count number of strings:

The are a number of ways to do this, see stringr's ?str_count for one:

wrong[,n_seps := str_count(wrong[[1]],fixed("|"))] # see below for explanation.

Or with some simplifying assumptions via an rcpp analogue:

If the separator is a single character (which it usually is) then I have found the simple function below to be most efficient. It is written is c++ and exported to R via the Rcpp package's sourceCpp() workhorse.

in a seperate "helpers.cpp" file

    #include <Rcpp.h>
    #include <algorithm>
    #include <string>

    using namespace Rcpp;
    using namespace std;

    // [[Rcpp::export]]

    NumericVector v_str_count_cpp(CharacterVector x, char y) {
        int n = x.size();
        NumericVector out(n);

        for(int i = 0; i < n; ++i) {
            out[i] = std::count(x[i].begin(), x[i].end(), y);
        }
        return out;
    }

New column with counts:

We then apply the function to count the number of occurences of | for each row and return the results in a new column called n_seps.

wrong[,n_seps := apply(wrong,1,v_str_count_cpp,"|")]

Now wrong looks like:

> wrong
var1|var2|var3|var4 n_seps
1:         a|1|10|TRUE      3
2:        b|2|10|FALSE      3
3:         c|3|10FALSE      2
4:         d|4|10|TRUE      3
5:         e|5|10|TRUE      3

now filter for the nice rows and collapse it back:

collapsed <- paste0( wrong[n_seps == 3][[1]], collapse = "\n" )

and lastly read it back with the proper separator:

correct <- fread(collapsed,sep="|")

which looks like:

> correct
V1 V2 V3    V4
1:  a  1 10  TRUE
2:  b  2 10 FALSE
3:  d  4 10  TRUE
4:  e  5 10  TRUE

Hope this helps.

140

answered Nov 12 '22 04:11

npjc

No. There is no option to make fread to do that.

There is discussion on GitHub about it, but it does not say what option should be used to make fread skip those lines (here: https://github.com/Rdatatable/data.table/issues/810)

answered Nov 12 '22 03:11

userJT

Related questions
                            
                                Add extra arguments to implicit S4 generic for a primitive function
                            
                                combine list elements based on element names
                            
                                Placement of error bars in barplot using ggplot2
                            
                                How can I get a list of all methods defined on an S4 class in R?
                            
                                repeat multiple NULL in R
                            
                                Assign a matrix to a subset of a data.table
                            
                                ggplot: axis don't intersect at origin
                            
                                What is pythonic way to do dt[,y:=myfun(x),by=list(a,b,c)] in R?
                            
                                TIFF plot generation and compression: R vs. GIMP vs. IrfanView vs. Photoshop file sizes
                            
                                Plot complex numbers in R with ggplot2
                            
                                Remove urls from strings
                            
                                Euler Project #1 in R
                            
                                How to return only the degrees of freedom from a summary of a regression in r?
                            
                                R: locpoly is incorrectly returning NaN
                            
                                Subset by multiple conditions
                            
                                How to shrink the inner margins of legend box
                            
                                How to Convert Numeric Data into Currency in R?
                            
                                Efficient multinomial sampling when sample size and probability vary
                            
                                Using an image as point icon in ggmap
                            
                                Java-R bridge "JRI" error: R is already initialized

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With