Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

N-gram generation from a sentence

How to generate an n-gram of a string like:

String Input="This is my car." 

I want to generate n-gram with this input:

Input Ngram size = 3 

Output should be:

This is my car  This is is my my car  This is my is my car 

Give some idea in Java, how to implement that or if any library is available for it.

I am trying to use this NGramTokenizer but its giving n-gram's of character sequence and I want n-grams of word sequence.

like image 734
Preetam Purbia Avatar asked Sep 07 '10 07:09

Preetam Purbia


People also ask

What is n-gram with example?

An N-gram means a sequence of N words. So for example, “Medium blog” is a 2-gram (a bigram), “A Medium blog post” is a 4-gram, and “Write on Medium” is a 3-gram (trigram). Well, that wasn't very interesting or exciting. True, but we still have to look at the probability used with n-grams, which is quite interesting.

How many Bigrams can be generated from the sentence?

Bigrams are sequence of two words that are appearing adjacent in a sentence. In the given sentence, we have 6 bigrams, 'Gandhiji is', 'is the', 'the father', 'father of', 'of our', and 'our nation'. 2.

What is n-gram model explain?

An N-gram language model predicts the probability of a given N-gram within any sequence of words in the language. A good N-gram model can predict the next word in the sentence i.e the value of p(w|h)


2 Answers

I believe this would do what you want:

import java.util.*;  public class Test {      public static List<String> ngrams(int n, String str) {         List<String> ngrams = new ArrayList<String>();         String[] words = str.split(" ");         for (int i = 0; i < words.length - n + 1; i++)             ngrams.add(concat(words, i, i+n));         return ngrams;     }      public static String concat(String[] words, int start, int end) {         StringBuilder sb = new StringBuilder();         for (int i = start; i < end; i++)             sb.append((i > start ? " " : "") + words[i]);         return sb.toString();     }      public static void main(String[] args) {         for (int n = 1; n <= 3; n++) {             for (String ngram : ngrams(n, "This is my car."))                 System.out.println(ngram);             System.out.println();         }     } } 

Output:

This is my car.  This is is my my car.  This is my is my car. 

An "on-demand" solution implemented as an Iterator:

class NgramIterator implements Iterator<String> {      String[] words;     int pos = 0, n;      public NgramIterator(int n, String str) {         this.n = n;         words = str.split(" ");     }      public boolean hasNext() {         return pos < words.length - n + 1;     }      public String next() {         StringBuilder sb = new StringBuilder();         for (int i = pos; i < pos + n; i++)             sb.append((i > pos ? " " : "") + words[i]);         pos++;         return sb.toString();     }      public void remove() {         throw new UnsupportedOperationException();     } } 
like image 199
aioobe Avatar answered Sep 18 '22 11:09

aioobe


You are looking for ShingleFilter.

Update: The link points to version 3.0.2. This class may be in different package in newer version of Lucene.

like image 23
Shashikant Kore Avatar answered Sep 16 '22 11:09

Shashikant Kore