I have a big table in my database with a lot of words from various texts in the text order. I want to find the number of times/frequency that some set of words appears together.
Example: Supposing I have this 4 words in many texts: United | States | of | America
. I will get as result:
United States: 50
United States of: 45
United States of America: 40
(This is only an example with 4 words, but can there are with less and more than 4).
There is some algorithm that can do this or similar to this?
Edit: Some R or SQL code showing how to do is welcome. I need a practical example of what I need to do.
Table Structure
I have two tables: Token
which haves id
and text
. The text is is UNIQUE
and each entrance in this table represents a different word.
TextBlockHasToken
is the table which keeps the text order. Each row represents a word in a text.
It haves textblockid
that is the block of the text the token belongs. sentence
that is the sentence of the token, position
that is the token position inside the sentence and tokenid
that is the token table reference.
It is called an N-gram; in your case a 4-gram. It can indeed be obtained as the by-product of a Markov-chain, but you could also use a sliding window (size 4) to walk through the (linear) text while updating a 4-dimensionsal "histogram".
UPDATE 2011-11-22: A markov chain is a way to model the probability of switching to a new state, given the current state. This is the stochastic equivalent of a "state machine". In the natural language case, the "state" is formed by the "previous N words", which implies that you consider the prior probability (before the previous N words) as equal_to_one. Computer people will most likely use a tree for implementing Markov chains in the NLP case. The "state" is simply the path taken from the root to the current node, and the probabilities of the words_to_follow are the probabilities of the current node's offspring. But every time that we choose a new child node, we actually shift down the tree, and "forget" the root node, out window is only N words wide, which translates to N levels deep into the tree.
You can easily see that if you are walking a Markov chain/tree like this, at any time the probability before the first word is 1, the probability after the first word is P(w1), after the second word = P(w2) || w1, etc. So, when processing the corpus you build a Markov tree ( := update the frequencies in the nodes), at the end of the ride you can estimate the probability of a given choice of word by freq(word) / SUM(freq(siblings)). For a word 5-deep into the tree this is the probability of the word, given the previous 4 words. If you want the N-gram probabilities, you want the product of all the probabilities in the path from the root to the last word.
This is a typical use case for Markov chains. Estimate the Markov model from your textbase and find high probabilites in the transition table. Since these indicate probabilities that one word will follow another, phrases will show up as high transition probabilites.
By counting the number of times the phrase-start word showed up in the texts, you can also derive absolute numbers.
Here is a small snippet that calculates all combinations/ngrams of a text for a given set of words. In order to work for larger datasets it uses the hash library, though it is probably still pretty slow...
require(hash)
get.ngrams <- function(text, target.words) {
text <- tolower(text)
split.text <- strsplit(text, "\\W+")[[1]]
ngrams <- hash()
current.ngram <- ""
for(i in seq_along(split.text)) {
word <- split.text[i]
word_i <- i
while(word %in% target.words) {
if(current.ngram == "") {
current.ngram <- word
} else {
current.ngram <- paste(current.ngram, word)
}
if(has.key(current.ngram, ngrams)) {
ngrams[[current.ngram]] <- ngrams[[current.ngram]] + 1
} else{
ngrams[[current.ngram]] <- 1
}
word_i <- word_i + 1
word <- split.text[word_i]
}
current.ngram <- ""
}
ngrams
}
So the following input ...
some.text <- "He states that he loves the United States of America,
and I agree it is nice in the United States."
some.target.words <- c("united", "states", "of", "america")
usa.ngrams <- get.ngrams(some.text, some.target.words)
... would result in the following hash:
>usa.ngrams
<hash> containing 10 key-value pair(s).
america : 1
of : 1
of america : 1
states : 3
states of : 1
states of america : 1
united : 2
united states : 2
united states of : 1
united states of america : 1
Notice that this function is case insensitive and registers any permutation of the target words, e.g:
some.text <- "States of united America are states"
some.target.words <- c("united", "states", "of", "america")
usa.ngrams <- get.ngrams(some.text, some.target.words)
...results in:
>usa.ngrams
<hash> containing 10 key-value pair(s).
america : 1
of : 1
of united : 1
of united america : 1
states : 2
states of : 1
states of united : 1
states of united america : 1
united : 1
united america : 1
I'm not sure if its of a help to you, but here is a little python program I wrote about a year ago that counts N-grams (well, only mono-, bi-, and trigrams). (It also calculates the entropy of each N-gram). I used it to count those N-grams in a large text. Link
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With