I'm writing a function in R to find formality statistics (a linguistic measure) on typed dialogue. I use openNLP
's parts of speech tagger to tag words (amazing tool but slow because it's doing some heavy duty stuff). Anyway time is an issue already with this function and I'm running into an issue that I want to make sur eruns a quickly as possible. I started thinking in convoluted terms and knew I needed some collective group think on this.
I have a list of vectors with tags in them like this:
G
[[1]]
[1] "MD" "DT" "NN" "VB" "VBG" "TO" "POS"
[[2]]
[1] "DT" "NN" "JJ" "RB"
[[3]]
[1] "RB" "TO" "PRP"
[[4]]
[1] "VBZ" "PRP" "VBG" "RB" "TO" "NN"
[[5]]
[1] "NN" "NN"
For each vector I want to count the frequency of occurrences of all possible tags (a zero will be inserted of a vector does not contain a tag) and generate a data frame structure like this below:
DT JJ MD NN POS PRP RB TO VB VBG VBZ
1 1 0 1 1 1 0 0 1 1 1 0
2 1 1 0 1 0 0 1 0 0 0 0
3 0 0 0 0 0 1 1 1 0 0 0
4 0 0 0 1 0 1 1 1 1 1 1
5 0 0 0 2 0 0 0 0 0 0 0
I've put my begining thinking around it below as well as the fake data set. I initially thought to go with table on this but I'm not sure 9as I know this is slower than say the use of rle
or match
or indexing [
if any of these can be used. I also thought about using Reduce
with merge
on these vectors to do a multi merge, but know that the higher order functions in R may be slower than other methods (perhaps this can be done with some sweet indexing).
Any way I'd greatly appreciate help on this problem. The two paremeters I'm looking for are:
The data and my initial thinking (table may be the wrong way to go:
G <- list(c("MD", "DT", "NN", "VB", "VBG", "TO", "POS"), c("DT", "NN",
"JJ", "RB"), c("RB", "TO", "PRP"), c("VBZ", "PRP", "VBG", "RB",
"TO", "NN"), c("NN", "NN"))
P <- lapply(G, function(x) table(sort(x))) #to get frequencies on each word
sort(unique(names(unlist(P)))) #to get the column names and number
Apologies for the thread name as this one is a hard one to classify.
EDIT: (added bench marking results)
Very creative answers. I didn't even think about the factor solution and specifying the levels. Smart. For speed Joran's second answer winds (I just added the column names back using you're already created lev
. mdsummer's response was the least amount of code and was tied-ish for second with speed. I'll go with Joran's second response as it will get me the best speed boost. Thank you all! Much appreciation :) Comparison available as a gist https://gist.github.com/trinker/91802b8c4ba759034881
expr min lq mean median uq max neval
JORAN1() 648.04435 689.16756 714.9142 712.59122 732.4991 831.6623 100
JORAN2() 86.83879 92.91911 98.7068 97.44690 101.6764 177.4228 100
RINKER() 87.40797 94.07564 100.1154 98.39624 104.0887 177.3146 100
TIM() 900.65847 964.23419 993.9475 988.89306 1023.0587 1137.6263 100
MDSUMMER() 1395.95920 1487.45279 1527.3181 1527.92664 1571.0997 1685.3298 100
I'd do either this:
lev <- sort(unique(unlist(G)))
G1 <- do.call(rbind,lapply(G,function(x,lev){ table(factor(x,levels = lev,
ordered = TRUE))},lev = lev))
DT JJ MD NN POS PRP RB TO VB VBG VBZ
[1,] 1 0 1 1 1 0 0 1 1 1 0
[2,] 1 1 0 1 0 0 1 0 0 0 0
[3,] 0 0 0 0 0 1 1 1 0 0 0
[4,] 0 0 0 1 0 1 1 1 0 1 1
[5,] 0 0 0 2 0 0 0 0 0 0 0
or for more speed (but losing the column names):
G1 <- do.call(rbind,lapply(G,function(x,lev){ tabulate(factor(x,levels = lev,
ordered = TRUE),nbins = length(lev))},lev = lev))
This does what you want I think, just get the full list of unique values as factor levels
and then tabulate based on each vector being an instance of that factor.
Then you can wrap the whole thing up in a do.call and bind the rows together:
levs <- sort(unique(names(unlist(P))))
do.call("rbind", lapply(G, function(x) table(factor(x, levs))))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With