Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

n-gram vectorization using TfidfVectorizer

I am using TfidfVectorizer with following parameters:

smooth_idf=False, sublinear_tf=False, norm=None, analyzer='word', ngram_range=(1,2)

I am vectorizing following text: "red sun, pink candy. Green flower."

Here is output of get_feature_names():

['candy', 'candy green', 'coffee', 'flower', 'green', 'green flower', 'hate', 'icecream', 'like', 'moon', 'pink', 'pink candy', 'red', 'red sun', 'sun', 'sun pink']

Since "candy" and "green" are part of the separate sentences, why is "candy green" n-gram created?

Is there a way to prevent creation of n-grams spawning multiple sentences?

like image 320
leon Avatar asked Mar 06 '26 02:03

leon


1 Answers

Depends on how you are passing that to TfidfVectorizer!

If passed as a single document, TfidfVectorizer will only keep words which contain 2 or more alphanumeric characters. Punctuation is completely ignored and always treated as a token separator. So your sentence becomes:

['red', 'sun', 'pink', 'candy', 'green', 'flower'] 

Now from these tokens, ngrams are generated.

Since TfidfVectorizer is a bag-of-words technique, working on words appearing in a document, it does not keep any information about the structure or order of words in a single document. If you want them to be treated separately, then you should detect the sentences yourself and pass them as different documents.

Or else, pass your own analyzer and ngram generator to the TfidfVectorizer.

For more information on how TfidfVectorizer actually works, see my other answer:

  • sklearn TfidfVectorizer : Generate Custom NGrams by not removing stopword in them
like image 96
Vivek Kumar Avatar answered Mar 08 '26 20:03

Vivek Kumar



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!