Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

create a data frame of tag frequencies from a list of tag vectors

Tags:

r

I'm writing a function in R to find formality statistics (a linguistic measure) on typed dialogue. I use openNLP's parts of speech tagger to tag words (amazing tool but slow because it's doing some heavy duty stuff). Anyway time is an issue already with this function and I'm running into an issue that I want to make sur eruns a quickly as possible. I started thinking in convoluted terms and knew I needed some collective group think on this.

I have a list of vectors with tags in them like this:

G 
[[1]]
[1] "MD"  "DT"  "NN"  "VB"  "VBG" "TO"  "POS"

[[2]]
[1] "DT" "NN" "JJ" "RB"

[[3]]
[1] "RB"  "TO"  "PRP"

[[4]]
[1] "VBZ" "PRP" "VBG" "RB"  "TO"  "NN" 

[[5]]
[1] "NN" "NN"

For each vector I want to count the frequency of occurrences of all possible tags (a zero will be inserted of a vector does not contain a tag) and generate a data frame structure like this below:

  DT  JJ  MD  NN  POS PRP RB  TO  VB  VBG VBZ
1  1   0   1   1    1   0  0   1   1    1   0
2  1   1   0   1    0   0  1   0   0    0   0
3  0   0   0   0    0   1  1   1   0    0   0
4  0   0   0   1    0   1  1   1   1    1   1
5  0   0   0   2    0   0  0   0   0    0   0

I've put my begining thinking around it below as well as the fake data set. I initially thought to go with table on this but I'm not sure 9as I know this is slower than say the use of rle or match or indexing [ if any of these can be used. I also thought about using Reduce with merge on these vectors to do a multi merge, but know that the higher order functions in R may be slower than other methods (perhaps this can be done with some sweet indexing).

Any way I'd greatly appreciate help on this problem. The two paremeters I'm looking for are:

  1. A base solution
  2. Speed

The data and my initial thinking (table may be the wrong way to go:

G <- list(c("MD", "DT", "NN", "VB", "VBG", "TO", "POS"), c("DT", "NN", 
"JJ", "RB"), c("RB", "TO", "PRP"), c("VBZ", "PRP", "VBG", "RB", 
"TO", "NN"), c("NN", "NN"))

P <- lapply(G, function(x) table(sort(x)))  #to get frequencies on each word
sort(unique(names(unlist(P))))  #to get the column names and number

Apologies for the thread name as this one is a hard one to classify.

EDIT: (added bench marking results)

Very creative answers. I didn't even think about the factor solution and specifying the levels. Smart. For speed Joran's second answer winds (I just added the column names back using you're already created lev. mdsummer's response was the least amount of code and was tied-ish for second with speed. I'll go with Joran's second response as it will get me the best speed boost. Thank you all! Much appreciation :) Comparison available as a gist https://gist.github.com/trinker/91802b8c4ba759034881

       expr        min         lq      mean     median        uq       max neval
   JORAN1()  648.04435  689.16756  714.9142  712.59122  732.4991  831.6623   100
   JORAN2()   86.83879   92.91911   98.7068   97.44690  101.6764  177.4228   100
   RINKER()   87.40797   94.07564  100.1154   98.39624  104.0887  177.3146   100
      TIM()  900.65847  964.23419  993.9475  988.89306 1023.0587 1137.6263   100
 MDSUMMER() 1395.95920 1487.45279 1527.3181 1527.92664 1571.0997 1685.3298   100
like image 868
Tyler Rinker Avatar asked Dec 22 '22 00:12

Tyler Rinker


2 Answers

I'd do either this:

lev <- sort(unique(unlist(G)))

G1 <- do.call(rbind,lapply(G,function(x,lev){ table(factor(x,levels = lev,
                                                     ordered = TRUE))},lev = lev))

     DT JJ MD NN POS PRP RB TO VB VBG VBZ
[1,]  1  0  1  1   1   0  0  1  1   1   0
[2,]  1  1  0  1   0   0  1  0  0   0   0
[3,]  0  0  0  0   0   1  1  1  0   0   0
[4,]  0  0  0  1   0   1  1  1  0   1   1
[5,]  0  0  0  2   0   0  0  0  0   0   0

or for more speed (but losing the column names):

G1 <- do.call(rbind,lapply(G,function(x,lev){ tabulate(factor(x,levels = lev,
                                ordered = TRUE),nbins = length(lev))},lev = lev))
like image 95
joran Avatar answered Dec 24 '22 02:12

joran


This does what you want I think, just get the full list of unique values as factor levels and then tabulate based on each vector being an instance of that factor.

Then you can wrap the whole thing up in a do.call and bind the rows together:

levs <- sort(unique(names(unlist(P))))

do.call("rbind", lapply(G, function(x) table(factor(x, levs))))
like image 36
mdsumner Avatar answered Dec 24 '22 01:12

mdsumner