Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When are n-grams (n>3) important as opposed to just bigrams or trigrams?

I am just wondering what is the use of n-grams (n>3) (and their occurrence frequency) considering the computational overhead in computing them. Are there any applications where bigrams or trigrams are simply not enough?

If so, what is the state-of-the-art in n-gram extraction? Any suggestions? I am aware of the following:

like image 487
Legend Avatar asked Apr 23 '12 18:04

Legend


People also ask

What are bigrams Trigrams and n-grams?

Using Latin numerical prefixes, an n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram" (or, less commonly, a "digram"); size 3 is a "trigram". English cardinal numbers are sometimes used, e.g., "four-gram", "five-gram", and so on.

What is the purpose of n-grams?

N-grams are continuous sequences of words or symbols or tokens in a document. In technical terms, they can be defined as the neighbouring sequences of items in a document. They come into play when we deal with text data in NLP(Natural Language Processing) tasks.

What is the purpose of n-grams in data analytics?

N-gram models are useful in many text analytics applications where sequences of words are relevant, such as in sentiment analysis, text classification, and text generation. N-gram modeling is one of the many techniques used to convert text from an unstructured format to a structured format.

How and when is gram tokenization is used?

The ngram tokenizer first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word of the specified length. N-grams are like a sliding window that moves across the word - a continuous sequence of characters of the specified length.


1 Answers

I'm not familiar with a good deal of the tags listed here, however n-grams (the abstract concept) are often useful related to statistical models. As a result, here's some applications which aren't restricted merely to bigrams and trigrams:

  • Compression algorithms (the PPM variety especially) where the length of the grams depends on how much data is available for providing specific contexts.
  • Approximate string matching (e.g. BLAST for genetic sequence matching)
  • Predictive models (e.g. name generators)
  • Speech recognition (phonemes grams are used to help evaluate the likelihood of possibilities for the current phoneme undergoing recognition)

Those are the ones off the top of my head, but there's much more listed on Wikipedia.

As far as "state-of-the-art" n-gram extraction, no idea. N-gram "extraction" is an adhoc attempt to speed up certain processes while still maintaining the benefits of n-gram style modeling. In short, "state-of-the-art" depends on what you're trying to do. If you're looking at fuzzy matching or fuzzy grouping, it depends on what kind of data you're matching/grouping. (E.g. street addresses are going to be very different to fuzzy match than first names.)

like image 178
Kaganar Avatar answered Oct 20 '22 22:10

Kaganar