What algorithm is used for finding ngrams?
Supposing my input data is an array of words and the size of the ngrams I want to find, what algorithm I should use?
I'm asking for code, with preference for R. The data is stored in database, so can be a plgpsql function too. Java is a language I know better, so I can "translate" it to another language.
I'm not lazy, I'm only asking for code because I don't want to reinvent the wheel trying to do an algorithm that is already done.
Edit: it's important know how many times each n-gram appears.
Edit 2: there is a R package for N-GRAMS?
An n-gram model is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n − 1)–order Markov model.
Based on the results, the model performs at its best with the n-gram range of (1,5). This means that training the model with n-grams ranging from unigrams to 5-grams help achieve optimal results, but larger n-grams only result in more sparse input features, which hampers model performance.
It's a probabilistic model that's trained on a corpus of text. Such a model is useful in many NLP applications including speech recognition, machine translation and predictive text input. An N-gram model is built by counting how often word sequences occur in corpus text and then estimating the probabilities.
If you want to use R
to identify ngrams, you can use the tm
package and the RWeka
package. It will tell you how many times the ngram occurs in your documents, like so:
library("RWeka")
library("tm")
data("crude")
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))
inspect(tdm[340:345,1:10])
A term-document matrix (6 terms, 10 documents)
Non-/sparse entries: 4/56
Sparsity : 93%
Maximal term length: 13
Weighting : term frequency (tf)
Docs
Terms 127 144 191 194 211 236 237 242 246 248
and said 0 0 0 0 0 0 0 0 0 0
and security 0 0 0 0 0 0 0 0 1 0
and set 0 1 0 0 0 0 0 0 0 0
and six-month 0 0 0 0 0 0 0 1 0 0
and some 0 0 0 0 0 0 0 0 0 0
and stabilise 0 0 0 0 0 0 0 0 0 1
hat-tip: http://tm.r-forge.r-project.org/faq.html
For anyone still interested in this topic, there is a package on the cran already.
ngram: An n-gram Babbler
This package offers utilities for creating, displaying, and "babbling" n-grams. The babbler is a simple Markov process.
http://cran.r-project.org/web/packages/ngram/index.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With