Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scikit learn ngram_range purpose in vectorizers

Tags:

scikit-learn

What is the use of ngram_range in vectorizers like countvectorizer and TFidf vectorizer. I mean ngram_range(1,1) is for unigram. what it means for ngram_range(1,2) and (2,2)???

like image 274
Sarath R Nair Avatar asked Nov 30 '13 12:11

Sarath R Nair


People also ask

What is ngram_range in TF-IDF?

ngram_range. The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of c(1, 1) means only unigrams, c(1, 2) means unigrams and bigrams, and c(2, 2) means only bigrams.

What is ngram_range?

ngram_range: An n-gram is just a string of n words in a row. E.g. the sentence 'I am Groot' contains the 2-grams 'I am' and 'am Groot'. The sentence is itself a 3-gram. Set the parameter ngram_range=(a,b) where a is the minimum and b is the maximum size of ngrams you want to include in your features.

What does CountVectorizer analyzer do?

CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.

What is count vectorization?

CountVectorizer means breaking down a sentence or any text into words by performing preprocessing tasks like converting all words to lowercase, thus removing special characters. In NLP models can't understand textual data they only accept numbers, so this textual data needs to be vectorized.


1 Answers

ngram_range(1, 2) means unigrams and bigrams, (2, 2) means only bigrams.. Don't you think the docstring is precise enough:

The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

How would you change the docstring to be more helpful?

like image 192
Andreas Mueller Avatar answered Nov 24 '22 10:11

Andreas Mueller