I have a large vector (100M elements) of words that are of the type:
words <- paste(letters,letters,letters,letters,sep="_")
(In actual data words are not all the same but all of length 8)
I would like to convert them to a data frame which has a column for each letter of the word and a row for each word. For this i have tried str_split_fixed and rbind on the result but on the large vector R freezes/takes forever.
so desired output of the form:
l1 l2 l3 l4
1 a a a a
2 b b b b
3 c c c c
Is there a faster way of doing this?
paste() to collapse the vector elements together fread() to parse the collapsed vector into data.table/data.frame As a function:
collapse2fread <- function(x,sep) {
require(data.table)
fread(paste0(x,collapse="\n"),sep=sep,header=FALSE)
}
Could also try doing it in c++ via Rcpp packages to get more out of it? Something like:
std::string collapse_cpp(CharacterVector subject,const std::string collapseBy){
int n = subject.size();
std::string collapsed;
for(int i=0;i<n;i++){
collapsed += std::string(subject[i]) + collapseBy;
}
return(collapsed);
}
Then we get:
collapse_cpp2fread <- function(x,sep) {
require(data.table)
fread(collapse_cpp(x,collapse="\n"),sep=sep,header=FALSE)
}
microbenchmark(
paste0(words,collapse="\n"),
collapse_cpp(words,"\n"),
times=100)
not much but it's someting:
> Unit: microseconds
> expr min lq median uq max neval
> paste0(words, collapse = "\\n") 7.297 7.7695 8.162 8.4255 33.824 100
> collapse_cpp(words, "\\n") 4.477 5.0095 5.117 5.3525 17.052 100
Make a more realistic input
words <- rep(paste0(letters[1:8], collapse = '_'), 1e5) # 100K elements
benchmark:
microbenchmark(
do.call(rbind, strsplit(words, '_')),
fread(paste0(words,collapse="\n"),sep="_",header=FALSE),
fread(collapse_cpp(words,"\n"),sep="_",header=FALSE),
times=10)
gives:
> Unit: milliseconds
> expr min lq median uq
> do.call(rbind, strsplit(words, "_")) 782.71782 796.19154 822.73694 854.22211
> fread(paste0(words, collapse = "\\n"), sep = "_", header = FALSE) 62.56164 64.13504 68.22512 71.96075
> fread(collapse_cpp(words, "\\n"), sep = "_", header = FALSE) 47.16362 47.78030 50.12867 52.23102
> max neval
> 863.0790 10
> 151.5969 10
> 109.9770 10
so about a 20x improvement at this size? hope it helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With