Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

large string vector to data.frame

I have a large vector (100M elements) of words that are of the type:

words <- paste(letters,letters,letters,letters,sep="_")

(In actual data words are not all the same but all of length 8)

I would like to convert them to a data frame which has a column for each letter of the word and a row for each word. For this i have tried str_split_fixed and rbind on the result but on the large vector R freezes/takes forever.

so desired output of the form:

      l1    l2    l3    l4
1     a     a     a     a  
2     b     b     b     b
3     c     c     c     c

Is there a faster way of doing this?

like image 396
sophia Avatar asked Dec 13 '25 00:12

sophia


1 Answers

Solution:

  • uses paste() to collapse the vector elements together
  • uses fread() to parse the collapsed vector into data.table/data.frame

As a function:

collapse2fread <- function(x,sep) {

    require(data.table)
    fread(paste0(x,collapse="\n"),sep=sep,header=FALSE)
}

Rcpp on top of that?

Could also try doing it in c++ via Rcpp packages to get more out of it? Something like:

std::string collapse_cpp(CharacterVector subject,const std::string collapseBy){

     int n = subject.size();
     std::string collapsed;

     for(int i=0;i<n;i++){
         collapsed += std::string(subject[i]) + collapseBy;
    }
    return(collapsed);
}

Then we get:

collapse_cpp2fread <- function(x,sep) {

    require(data.table)
    fread(collapse_cpp(x,collapse="\n"),sep=sep,header=FALSE)
}

quick test for the cpp fxn

microbenchmark(
    paste0(words,collapse="\n"),
    collapse_cpp(words,"\n"),
    times=100)

not much but it's someting:

> Unit: microseconds
>                             expr   min     lq median     uq    max neval
>  paste0(words, collapse = "\\n") 7.297 7.7695  8.162 8.4255 33.824   100
>       collapse_cpp(words, "\\n") 4.477 5.0095  5.117 5.3525 17.052   100

Comparison to strsplit method:

Make a more realistic input

words <- rep(paste0(letters[1:8], collapse = '_'), 1e5) # 100K elements

benchmark:

microbenchmark(
    do.call(rbind, strsplit(words, '_')),
    fread(paste0(words,collapse="\n"),sep="_",header=FALSE),
    fread(collapse_cpp(words,"\n"),sep="_",header=FALSE),
    times=10)

gives:

> Unit: milliseconds
>                                                               expr       min        lq    median                  uq
>                               do.call(rbind, strsplit(words, "_")) 782.71782 796.19154 822.73694 854.22211
> fread(paste0(words, collapse = "\\n"), sep = "_", header = FALSE)  62.56164  64.13504  68.22512  71.96075
> fread(collapse_cpp(words, "\\n"), sep = "_", header = FALSE)  47.16362  47.78030  50.12867  52.23102
>      max neval
> 863.0790    10
> 151.5969    10
> 109.9770    10

so about a 20x improvement at this size? hope it helps!

like image 68
npjc Avatar answered Dec 15 '25 13:12

npjc



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!