You can count the number of fields per line in a comma/tab/whatever delimited text file using utils::count.fields
.
Here's a reproducible example:
d <- data.frame(
x = c(1, NA, 3, NA, 5),
y = c(NA, "b", "c", NA, NA),
z = c(NA, "beta", "gamma", NA, "epsilon")
)
fname <- "test.csv"
write.csv(d, fname, na = "", row.names = FALSE)
count.fields(fname, sep = ",")
## [1] 3 3 3 3 3 3
I want to calculate the number of non-empty fields per line. I can do this in a clunky way by reading in everything and counting the number of values that aren't NA
.
d2 <- read.csv(fname, na.strings = "")
rowSums(!is.na(d2))
## [1] 1 2 3 0 2
I'd really like a way of scanning the file (like count.fields
) so I can target specific sections to read in.
Is there a better way of counting the number of non-empty fields in a delimited file?
This should be completely portable provided you have the Rcpp
& BH
packages installed:
library(Rcpp)
library(inline)
csvblanks <- '
string data = as<string>(filename);
ifstream fil(data.c_str());
if (!fil.is_open()) return(R_NilValue);
typedef tokenizer< escaped_list_separator<char> > Tokenizer;
vector<string> fields;
vector<int> retval;
string line;
while (getline(fil, line)) {
int numblanks = 0;
Tokenizer tok(line);
for(Tokenizer::iterator beg=tok.begin(); beg!=tok.end(); ++beg){
numblanks += (beg->length() == 0) ? 1 : 0 ;
};
retval.push_back(numblanks);
}
return(wrap(retval));
'
count_blanks <- rcpp(
signature(filename="character"),
body=csvblanks,
includes=c("#include <iostream>",
"#include <fstream>",
"#include <vector>",
"#include <string>",
"#include <algorithm>",
"#include <iterator>",
"#include <boost/tokenizer.hpp>",
"using namespace Rcpp;",
"using namespace std;",
"using namespace boost;")
)
Once that's sourced you can call count_blanks(FULLPATH)
and it will return a numeric vector of counts of blank fields per line.
I ran it against this file:
"DATE","APIKEY","FILENAME","LANGUAGE","JOBID","TRANSCRIPT"
1,2,3,4,5
1,,3,4,5
1,2,3,4,5
1,2,,4,5
1,2,3,4,5
1,2,3,,5
1,2,3,4,5
1,2,3,4,
1,2,3,4,5
1,,3,,5
1,2,3,4,5
,2,,4,
1,2,3,4,5
via:
count_blanks("/tmp/a.csv")
## [1] 0 0 1 0 1 0 1 0 1 0 2 0 3 0
CAVEATS
header
logical parameter with associated C/C++ code (which will be pretty straightforward).[:space:]+
) as "empty" you'll need something a bit more complex than the call to length
. This is one potential way to deal with it if you need to.escaped_list_separator
which is defined here. That can also be customized with with quote & separator characters (making it possible to further mimic read.csv
/read.table
.This will more closely approach count.fields
/C_countfields
performance and will eliminate the need to consume memory by reading in every line just to find the lines you eventually want to more optimally target. I don't think preallocating space for the returned numeric vector will add much to the speed, but you can see the discussion here which shows how to do so if need be.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With