Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What algorithm I need to find n-grams?

Tags:

r

n-gram

What algorithm is used for finding ngrams?

Supposing my input data is an array of words and the size of the ngrams I want to find, what algorithm I should use?

I'm asking for code, with preference for R. The data is stored in database, so can be a plgpsql function too. Java is a language I know better, so I can "translate" it to another language.

I'm not lazy, I'm only asking for code because I don't want to reinvent the wheel trying to do an algorithm that is already done.

Edit: it's important know how many times each n-gram appears.

Edit 2: there is a R package for N-GRAMS?

like image 739
Renato Dinhani Avatar asked Nov 17 '11 01:11

Renato Dinhani


People also ask

What is n-gram algorithm?

An n-gram model is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n − 1)–order Markov model.

What is n-gram range in NLP?

Based on the results, the model performs at its best with the n-gram range of (1,5). This means that training the model with n-grams ranging from unigrams to 5-grams help achieve optimal results, but larger n-grams only result in more sparse input features, which hampers model performance.

What is n-gram model in AI?

It's a probabilistic model that's trained on a corpus of text. Such a model is useful in many NLP applications including speech recognition, machine translation and predictive text input. An N-gram model is built by counting how often word sequences occur in corpus text and then estimating the probabilities.


2 Answers

If you want to use R to identify ngrams, you can use the tm package and the RWeka package. It will tell you how many times the ngram occurs in your documents, like so:

  library("RWeka")
  library("tm")

  data("crude")

  BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
  tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))

  inspect(tdm[340:345,1:10])

A term-document matrix (6 terms, 10 documents)

Non-/sparse entries: 4/56
Sparsity           : 93%
Maximal term length: 13 
Weighting          : term frequency (tf)

               Docs
Terms           127 144 191 194 211 236 237 242 246 248
  and said        0   0   0   0   0   0   0   0   0   0
  and security    0   0   0   0   0   0   0   0   1   0
  and set         0   1   0   0   0   0   0   0   0   0
  and six-month   0   0   0   0   0   0   0   1   0   0
  and some        0   0   0   0   0   0   0   0   0   0
  and stabilise   0   0   0   0   0   0   0   0   0   1

hat-tip: http://tm.r-forge.r-project.org/faq.html

like image 66
Ben Avatar answered Oct 04 '22 09:10

Ben


For anyone still interested in this topic, there is a package on the cran already.

ngram: An n-gram Babbler

This package offers utilities for creating, displaying, and "babbling" n-grams. The babbler is a simple Markov process.

http://cran.r-project.org/web/packages/ngram/index.html

like image 41
IceBruce Avatar answered Oct 04 '22 09:10

IceBruce