data.table 1.9.2
I'm reading in a large table and there appears to be at least one row which produces an error of the following nature:
Error in fread(paste(base_dir, filename, sep = "")) :
Expected sep ('|') but '' ends field 23 on line 190333 when reading data:...
Is it possible to direct fread in data.table
package to skip erroneous rows?
Or any other way I can work around this sort of error in the future?
Not only was fread() almost 2.5 times faster than readr's functionality in reading and binding the data, but perhaps even more importantly, the maximum used memory was only 15.25 GB, compared to readr's 27 GB. Interestingly, even though very slow, base R also spent less memory than the tidyverse suite.
Like all functions in the data. table R package, fread is fast. Very fast. But there's more to fread than speed.
data.table is an R package that provides an enhanced version of data.frame s, which are the standard data structure for storing data in base R. In the Data section above, we already created a data.table using fread() . We can also create one using the data.table() function.
One workaround if you wish to skip erroneous rows:
First read in the file only separating according to new rows by using sep="\n"
then count the number of separators for each row and filter for the correct # of separators then collapse
the data and separate according to the true column separator. see example below.
require(data.table)
wrong <- fread("
var1|var2|var3|var4
a|1|10|TRUE
b|2|10|FALSE
c|3|10FALSE # note the missing separator between 10 and FALSE.
d|4|10|TRUE
e|5|10|TRUE",sep="\n")
The are a number of ways to do this, see stringr
's ?str_count
for one:
wrong[,n_seps := str_count(wrong[[1]],fixed("|"))] # see below for explanation.
Or with some simplifying assumptions via an rcpp
analogue:
If the separator is a single character (which it usually is) then I have found the simple function below to be most efficient. It is written is c++
and exported to R
via the Rcpp
package's sourceCpp()
workhorse.
#include <Rcpp.h>
#include <algorithm>
#include <string>
using namespace Rcpp;
using namespace std;
// [[Rcpp::export]]
NumericVector v_str_count_cpp(CharacterVector x, char y) {
int n = x.size();
NumericVector out(n);
for(int i = 0; i < n; ++i) {
out[i] = std::count(x[i].begin(), x[i].end(), y);
}
return out;
}
We then apply the function to count the number of occurences of |
for each row and return the results
in a new column called n_seps
.
wrong[,n_seps := apply(wrong,1,v_str_count_cpp,"|")]
Now wrong
looks like:
> wrong
var1|var2|var3|var4 n_seps
1: a|1|10|TRUE 3
2: b|2|10|FALSE 3
3: c|3|10FALSE 2
4: d|4|10|TRUE 3
5: e|5|10|TRUE 3
collapsed <- paste0( wrong[n_seps == 3][[1]], collapse = "\n" )
correct <- fread(collapsed,sep="|")
which looks like:
> correct
V1 V2 V3 V4
1: a 1 10 TRUE
2: b 2 10 FALSE
3: d 4 10 TRUE
4: e 5 10 TRUE
Hope this helps.
No. There is no option to make fread to do that.
There is discussion on GitHub about it, but it does not say what option should be used to make fread skip those lines (here: https://github.com/Rdatatable/data.table/issues/810)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With