Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String decomposition

I need to decompose about 75 million character strings using R. I need to do something like creating a Term Document matrix, where each word that occurs in the document becomes a column in the matrix and anywhere the term occurs, the matrix element is coded as 1.

I have: About 75 million character strings ranging in length from about 0-100 characters; they represent a time series giving coded information about what happened in that period. Each code is exactly one character and corresponds to a time period.

I need: Some kind of matrix or way of conveying the information that takes away the time series and just tells me how many times a certain code was reported in each series.

For instance: The string "ABCDEFG-123" would become be a row in the matrix where each character would be tallied as occurring once. If this is too difficult a matrix of 0s and 1s would also give me some information though I would prefer to keep as much information as possible.

Does anyone have any ideas of how to do this quickly? There are 20 possible codes.

like image 528
dc3 Avatar asked Feb 08 '23 15:02

dc3


2 Answers

Example:

my20chars = c(LETTERS[1:10], 0:9)

set.seed(1)
x = replicate(1e4, paste0(sample(c(my20chars,"-"),10, replace=TRUE), collapse=""))

One approach:

library(data.table)

d = setDT(stack(strsplit(setNames(x,x),"")))
dcast(d[ values %in% my20chars ], ind ~ values, fun = length)

Result:

              ind 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J
    1: ---8EEAD8I 0 0 0 0 0 0 0 0 2 0 1 0 0 1 2 0 0 0 1 0
    2: --33B6E-32 0 0 1 3 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0
    3: --3IFBG8GI 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 1 2 0 2 0
    4: --4210I8H5 1 1 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0
    5: --5H4DE9F- 0 0 0 0 1 1 0 0 0 1 0 0 0 1 1 1 0 1 0 0
   ---                                                   
 9996: JJFJBJ24AJ 0 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 5
 9997: JJI-J-0FGB 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 3
 9998: JJJ1B54H63 0 1 0 1 1 1 1 0 0 0 0 1 0 0 0 0 0 1 0 3
 9999: JJJED7A3FI 0 0 0 1 0 0 0 1 0 0 1 0 0 1 1 1 0 0 1 3
10000: JJJIF6GI13 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 2 3

Benchmark:

library(microbenchmark)

nstrs  = 1e5
nchars = 10
x = replicate(nstrs, paste0(sample(c(my20chars,"-"), nchars, replace=TRUE), collapse=""))

microbenchmark(
dcast = {
  d = setDT(stack(strsplit(setNames(x,x),"")))
  dcast(d[ values %in% my20chars ], ind ~ values, fun = length, value.var="ind")
},
times = 10)

# Unit: seconds
#   expr      min       lq     mean   median       uq      max neval
#  dcast 3.112633 3.423935 3.480692 3.494176 3.573967 3.741931    10

So, this is not fast enough to handle the OP's 75 million strings, but may be a good place to start.

like image 59
Frank Avatar answered Feb 16 '23 01:02

Frank


I really like @Frank's solution, but here's another way, that has two advantages:

  • It uses a sparse matrix format, so you are more likely to fit everything into memory; and

  • It is (even) simpler.

It uses our quanteda package, where you tokenise the characters in each string and form a document-feature matrix from these in one command:

my20chars = c(LETTERS[1:10], 0:9)
set.seed(1)
x = replicate(1e4, paste0(sample(c(my20chars,"-"),10, replace=TRUE), collapse=""))

require(quanteda)
myDfm <- dfm(x, what = "character", toLower = FALSE, verbose = FALSE)
# for equivalent printing, does not change content:
myDfm <- myDfm[, order(features(myDfm))]
rownames(myDfm) <- x
head(myDfm)
# Document-feature matrix of: 6 documents, 20 features.
# 6 x 20 sparse Matrix of class "dfmSparse"
#             features
# docs         0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J
#   FH29E8933B 0 0 1 2 0 0 0 0 1 2 0 1 0 0 1 1 0 1 0 0
#   ED4I605-H6 1 0 0 0 1 1 2 0 0 0 0 0 0 1 1 0 0 1 1 0
#   9E3CFIAI8H 0 0 0 1 0 0 0 0 1 1 1 0 1 0 1 1 0 1 2 0
#   020D746C5I 2 0 1 0 1 1 1 1 0 0 0 0 1 1 0 0 0 0 1 0
#   736116A054 1 2 0 1 1 1 2 1 0 0 1 0 0 0 0 0 0 0 0 0
#   08JFBCG03I 2 0 0 1 0 0 0 0 1 0 0 1 1 0 0 1 1 0 1 1

Disadvantage:

  • It's (much) slower.

Benchmark:

microbenchmark(
    dcast = {
        d = setDT(stack(strsplit(setNames(x,x),"")))
        dcast(d[ values %in% my20chars ], ind ~ values, fun = length, value.var="ind")
    },
    quanteda = dfm(x, what = "character", toLower = FALSE, removePunct = FALSE, verbose = FALSE),
    times = 10)
# Unit: seconds
#      expr       min        lq      mean    median        uq       max naval
#     dcast  2.380971  2.423677  2.465338  2.429331  2.521256  2.636102    10
#  quanteda 21.106883 21.168145 21.369443 21.345173 21.519018 21.883966    10
like image 31
Ken Benoit Avatar answered Feb 16 '23 03:02

Ken Benoit