Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does the ngrams() function give distinct bigrams?

Tags:

r

nlp

n-gram

I am writing an R script and am using library(ngram).

Suppose I have a string,

"good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better"

and want to find bi-grams.

The ngram library is giving me bi-grams as follows:

"appreci product" "process meat" "food product" "food bought" "qualiti dog" "product found" "product look" "look like" "like stew" "good qualiti" "labrador finicki" "bought sever" "qualiti product" "better labrador" "dog food" "smell better" "vital can" "meat smell" "found good" "sever vital" "stew process" "can dog" "finicki appreci" "product better"

As the sentence contains "dog food" two times, I want this bi-gram two times. But I am getting it once!

Is there an option in thengram library or any other library that gives all the bi-grams of my sentence in R?

like image 941
KrunalParmar Avatar asked Sep 29 '15 17:09

KrunalParmar


People also ask

What are Bigrams in R?

A pair of words is called a “bigram”. More generally, a token comprising n words is called an “n-gram” (or “ngram”). Tokenising on bigrams or n-grams enable you to capture examine the correlations, and more importantly, the immediate context around each word.

What is the purpose of ngram?

An n-gram is a collection of n successive items in a text document that may include words, numbers, symbols, and punctuation. N-gram models are useful in many text analytics applications where sequences of words are relevant, such as in sentiment analysis, text classification, and text generation.

What is Ngrams in NLP?

N-grams are continuous sequences of words or symbols or tokens in a document. In technical terms, they can be defined as the neighbouring sequences of items in a document. They come into play when we deal with text data in NLP(Natural Language Processing) tasks.

How do you use Ngrams features?

An n-gram is simply any sequence of n tokens (words). Consequently, given the following review text - “Absolutely wonderful - silky and sexy and comfortable”, we could break this up into: 1-grams: Absolutely, wonderful, silky, and, sexy, and, comfortable.


1 Answers

The development version of ngram has a get.phrasetable method:

devtools::install_github("wrathematics/ngram")
library(ngram)

text <- "good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better"

ng <- ngram(text)
head(get.phrasetable(ng))
#            ngrams freq       prop
# 1    good qualiti    2 0.07692308
# 2        dog food    2 0.07692308
# 3 appreci product    1 0.03846154
# 4    process meat    1 0.03846154
# 5    food product    1 0.03846154
# 6     food bought    1 0.03846154

In addition, you can use the print() method and specify output == "full". That is:

print(ng, output = "full")

# NOTE: more output not shown...
better labrador | 1 
finicki {1} | 

dog food | 2 
product {1} | bought {1} 
# NOTE: more output not shown...
like image 61
JasonAizkalns Avatar answered Sep 28 '22 03:09

JasonAizkalns