Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what is the difference between bigram and unigram text features extraction

I searched online to do bi-gram and unigram text features' extraction, but still didn't find something useful information, can someone tell me what is the difference between them?

For example, if I have a text "I have a lovely dog" what will happen if I use bi-gram way to do features extraction and to do unigram extraction?

like image 468
user144600 Avatar asked Apr 18 '17 04:04

user144600


People also ask

What is the difference between unigram and bigram?

For unigram, we will get 3 features - 'I', 'ate', 'banana' and all 3 are independent of each other. Although this is not the case in real languages. In Bigram we assume that each occurrence of each word depends only on its previous word. Hence two words are counted as one gram(feature) here.

Which is better unigram or bigram?

Bayes Classifier using N-Gram namely Unigram, Bigram, Trigram with research results that show Unigram can provide better test results than Bigram and Trigram with an average accuracy of 81.30%.

What is bigram feature?

A bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words.

What is the difference between bag of words and n-gram?

Bag of n-grams is a natural extension of bag of words. An n-gram is simply any sequence of n tokens (words). Consequently, given the following review text - “Absolutely wonderful - silky and sexy and comfortable”, we could break this up into: 1-grams: Absolutely, wonderful, silky, and, sexy, and, comfortable.


2 Answers

We are trying to teach machine how to do natural language processing. We human can understand language easily but machines cannot so we trying to teach them specific pattern of language. As specific word has meaning but when we combine the words(i.e group of words) than it will be more helpful to understand the meaning.

n-gram is basically set of occurring words within given window so when

  • n=1 it is Unigram

  • n=2 it is bigram

  • n=3 it is trigram and so on

Now suppose machine try to understand the meaning of sentence "I have a lovely dog" then it will split sentences into a specific chunk.

  1. It will consider word one by one which is unigram so each word will be a gram.

    "I", "have", "a" , "lovely" , "dog"

  2. It will consider two words at a time so it will be biagram so each two adjacent words will be biagram

    "I have" , "have a" , "a lovely" , "lovely dog"

So like this machine will split sentences into small group of words to understand its meaning

like image 75
Sagar Damani Avatar answered Oct 07 '22 10:10

Sagar Damani


Example: Consider the sentence "I ate banana".

In Unigram we assume that the occurrence of each word is independent of its previous word. Hence each word becomes a gram(feature) here.

For unigram, we will get 3 features - 'I', 'ate', 'banana' and all 3 are independent of each other. Although this is not the case in real languages.

In Bigram we assume that each occurrence of each word depends only on its previous word. Hence two words are counted as one gram(feature) here.

For bigram, we will get 2 features - 'I ate' and 'ate banana'. This makes sense since the model will learn that 'banana' comes after 'ate' and not the other way around.

Similarly, we can have trigram.......n-gram.

like image 21
Rishabh Avatar answered Oct 07 '22 11:10

Rishabh