I am just wondering what is the use of n-grams (n>3) (and their occurrence frequency) considering the computational overhead in computing them. Are there any applications where bigrams or trigrams are simply not enough? If so, what is the state-of-the-art in n-gram extraction? Any suggestions? I am aware of the following: <ul> <li><a href="http://acl.ldc.upenn.edu/C/C94/C94-1101.pdf">A new method of n-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese</a></li> <li> <a href="http://acl.ldc.upenn.edu/J/J01/J01-1001.pdf">Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus</a> </li> <li>Word association norms, mutual information, and lexicography</li> <li>Retrieving collocations from text: Xtract</li> </ul>

I'm not familiar with a good deal of the tags listed here, however n-grams (the abstract concept) are often useful related to statistical models. As a result, here's some applications which aren't restricted merely to bigrams and trigrams: <ul> <li>Compression algorithms (the PPM variety especially) where the length of the grams depends on how much data is available for providing specific contexts.</li> <li>Approximate string matching (e.g. BLAST for genetic sequence matching)</li> <li>Predictive models (e.g. name generators)</li> <li>Speech recognition (phonemes grams are used to help evaluate the likelihood of possibilities for the current phoneme undergoing recognition)</li> </ul> Those are the ones off the top of my head, but there's much more listed on Wikipedia. As far as "state-of-the-art" n-gram extraction, no idea. N-gram "extraction" is an adhoc attempt to speed up certain processes while still maintaining the benefits of n-gram style modeling. In short, "state-of-the-art" depends on what you're trying to do. If you're looking at fuzzy matching or fuzzy grouping, it depends on what kind of data you're matching/grouping. (E.g. street addresses are going to be very different to fuzzy match than first names.)

When are n-grams (n>3) important as opposed to just bigrams or trigrams?

Tags:

nlp

nltk

data-mining

n-gram

I am just wondering what is the use of n-grams (n>3) (and their occurrence frequency) considering the computational overhead in computing them. Are there any applications where bigrams or trigrams are simply not enough?

If so, what is the state-of-the-art in n-gram extraction? Any suggestions? I am aware of the following:

A new method of n-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese
Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus
Word association norms, mutual information, and lexicography
Retrieving collocations from text: Xtract

487

asked Apr 23 '12 18:04

Legend

1 Answers

I'm not familiar with a good deal of the tags listed here, however n-grams (the abstract concept) are often useful related to statistical models. As a result, here's some applications which aren't restricted merely to bigrams and trigrams:

Compression algorithms (the PPM variety especially) where the length of the grams depends on how much data is available for providing specific contexts.
Approximate string matching (e.g. BLAST for genetic sequence matching)
Predictive models (e.g. name generators)
Speech recognition (phonemes grams are used to help evaluate the likelihood of possibilities for the current phoneme undergoing recognition)

Those are the ones off the top of my head, but there's much more listed on Wikipedia.

As far as "state-of-the-art" n-gram extraction, no idea. N-gram "extraction" is an adhoc attempt to speed up certain processes while still maintaining the benefits of n-gram style modeling. In short, "state-of-the-art" depends on what you're trying to do. If you're looking at fuzzy matching or fuzzy grouping, it depends on what kind of data you're matching/grouping. (E.g. street addresses are going to be very different to fuzzy match than first names.)

178

answered Oct 20 '22 22:10

Kaganar

Related questions
                            
                                NLP software for classification of large datasets
                            
                                Causal Sentences Extraction Using NLTK python
                            
                                How to automatically label a cluster of words using semantics?
                            
                                how could I use complete penn treebank dataset inside python/nltk
                            
                                NLP of Legal Texts?
                            
                                Gensim: how to load precomputed word vectors from text file
                            
                                Natural Language Processing - Word Alignment
                            
                                How to get the wordnet sense frequency of a synset in NLTK?
                            
                                How does TfidfVectorizer compute scores on test data
                            
                                Naive bayes calculation in sql
                            
                                How do you find the subject of a sentence? [closed]
                            
                                Resolve coreference using Stanford CoreNLP - unable to load parser model
                            
                                doc2vec: How is PV-DBOW implemented
                            
                                How to treat numbers inside text strings when vectorizing words?
                            
                                Keras Multitask learning with two different input sample size
                            
                                Python: Tokenizing with phrases
                            
                                The relationship between latent Dirichlet allocation and documents clustering
                            
                                Unsupervised HMM training in NLTK
                            
                                Where can I find a text list or library that contains a list of common foods? [closed]
                            
                                pytorch embedding index out of range

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With