n-gram vectorization using TfidfVectorizer

Question

I am using TfidfVectorizer with following parameters:

smooth_idf=False, sublinear_tf=False, norm=None, analyzer='word', ngram_range=(1,2)

I am vectorizing following text: "red sun, pink candy. Green flower."

Here is output of get_feature_names():

['candy', 'candy green', 'coffee', 'flower', 'green', 'green flower', 'hate', 'icecream', 'like', 'moon', 'pink', 'pink candy', 'red', 'red sun', 'sun', 'sun pink']

Since "candy" and "green" are part of the separate sentences, why is "candy green" n-gram created?

Is there a way to prevent creation of n-grams spawning multiple sentences?

Vivek Kumar · Accepted Answer

Depends on how you are passing that to TfidfVectorizer!

If passed as a single document, TfidfVectorizer will only keep words which contain 2 or more alphanumeric characters. Punctuation is completely ignored and always treated as a token separator. So your sentence becomes:

['red', 'sun', 'pink', 'candy', 'green', 'flower']

Now from these tokens, ngrams are generated.

Since TfidfVectorizer is a bag-of-words technique, working on words appearing in a document, it does not keep any information about the structure or order of words in a single document. If you want them to be treated separately, then you should detect the sentences yourself and pass them as different documents.

Or else, pass your own analyzer and ngram generator to the TfidfVectorizer.

For more information on how TfidfVectorizer actually works, see my other answer:

sklearn TfidfVectorizer : Generate Custom NGrams by not removing stopword in them

n-gram vectorization using TfidfVectorizer

Tags:

scikit-learn

tf-idf

leon

1 Answers

Vivek Kumar

Recent Activity

Donate For Us

n-gram vectorization using TfidfVectorizer

Tags:

scikit-learn

tf-idf

leon

1 Answers

Vivek Kumar

Related questions

Recent Activity

Donate For Us