Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to preserve punctuation marks in Scikit-Learn text CountVectorizer or TfidfVectorizer?

Is there any way for me to preserve punctuation marks of !, ?, " and ' from my text documents using text CountVectorizer or TfidfVectorizer parameters in scikit-learn?

like image 262
Suhairi Suhaimin Avatar asked Aug 31 '16 15:08

Suhairi Suhaimin


People also ask

What is the difference between CountVectorizer and TfidfVectorizer?

With Tfidftransformer you will systematically compute word counts using CountVectorizer and then compute the Inverse Document Frequency (IDF) values and only then compute the Tf-idf scores. With Tfidfvectorizer on the contrary, you will do all three steps at once.

Does CountVectorizer remove punctuation?

The default tokenization in CountVectorizer removes all special characters, punctuation and single characters. If this is not the behavior you desire, and you want to keep punctuation and special characters, you can provide a custom tokenizer to CountVectorizer.

What is CountVectorizer in Sklearn?

CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.

What is the difference between TfidfVectorizer and TfidfTransformer?

The main difference between the 2 implementations is that TfidfVectorizer performs both term frequency and inverse document frequency for you, while using TfidfTransformer will require you to use the CountVectorizer class from Scikit-Learn to perform Term Frequency.


1 Answers

You should customize the token_pattern parameter when you instantiate the vectorizer. For example:

vent = CountVectorizer(token_pattern=r"(?u)\b\w\w+\b|!|\?|\"|\'")
like image 73
elyase Avatar answered Sep 19 '22 00:09

elyase