Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to count the number of non-empty fields in a delimited file?

Tags:

import

r

csv

You can count the number of fields per line in a comma/tab/whatever delimited text file using utils::count.fields.

Here's a reproducible example:

d <- data.frame(
  x = c(1, NA, 3, NA, 5),
  y = c(NA, "b", "c", NA, NA),
  z = c(NA, "beta", "gamma", NA, "epsilon")
)

fname <- "test.csv"
write.csv(d, fname, na = "",  row.names = FALSE)
count.fields(fname, sep = ",")
## [1] 3 3 3 3 3 3

I want to calculate the number of non-empty fields per line. I can do this in a clunky way by reading in everything and counting the number of values that aren't NA.

d2 <- read.csv(fname, na.strings = "")
rowSums(!is.na(d2))
## [1] 1 2 3 0 2

I'd really like a way of scanning the file (like count.fields) so I can target specific sections to read in.

Is there a better way of counting the number of non-empty fields in a delimited file?

like image 715
Richie Cotton Avatar asked Sep 20 '15 07:09

Richie Cotton


1 Answers

This should be completely portable provided you have the Rcpp & BH packages installed:

library(Rcpp)
library(inline)

csvblanks <- '
string data = as<string>(filename);
ifstream fil(data.c_str());
if (!fil.is_open()) return(R_NilValue);

typedef tokenizer< escaped_list_separator<char> > Tokenizer;

vector<string> fields;
vector<int> retval;
string line;

while (getline(fil, line)) {
  int numblanks = 0;
  Tokenizer tok(line);
  for(Tokenizer::iterator beg=tok.begin(); beg!=tok.end(); ++beg){
    numblanks += (beg->length() == 0) ? 1 : 0 ;
  };
  retval.push_back(numblanks);
}
return(wrap(retval));
'

count_blanks <- rcpp(
  signature(filename="character"),
  body=csvblanks,
  includes=c("#include <iostream>",
             "#include <fstream>",
             "#include <vector>",
             "#include <string>",
             "#include <algorithm>",
             "#include <iterator>",
             "#include <boost/tokenizer.hpp>",
             "using namespace Rcpp;",
             "using namespace std;",
             "using namespace boost;")
)

Once that's sourced you can call count_blanks(FULLPATH) and it will return a numeric vector of counts of blank fields per line.

I ran it against this file:

"DATE","APIKEY","FILENAME","LANGUAGE","JOBID","TRANSCRIPT"
1,2,3,4,5
1,,3,4,5
1,2,3,4,5
1,2,,4,5
1,2,3,4,5
1,2,3,,5
1,2,3,4,5
1,2,3,4,
1,2,3,4,5
1,,3,,5
1,2,3,4,5
,2,,4,
1,2,3,4,5

via:

count_blanks("/tmp/a.csv")
## [1] 0 0 1 0 1 0 1 0 1 0 2 0 3 0

CAVEATS

  • It's fairly obvious that it's not ignoring the header, so it could use a header logical parameter with associated C/C++ code (which will be pretty straightforward).
  • If you're counting "spaces" (i.e. [:space:]+) as "empty" you'll need something a bit more complex than the call to length. This is one potential way to deal with it if you need to.
  • It's using the default configuration for the Boost function escaped_list_separator which is defined here. That can also be customized with with quote & separator characters (making it possible to further mimic read.csv/read.table.

This will more closely approach count.fields/C_countfields performance and will eliminate the need to consume memory by reading in every line just to find the lines you eventually want to more optimally target. I don't think preallocating space for the returned numeric vector will add much to the speed, but you can see the discussion here which shows how to do so if need be.

like image 146
hrbrmstr Avatar answered Sep 29 '22 11:09

hrbrmstr