Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make TfidfVectorizer only learn alphabetical characters as part of the vocabulary (exclude numbers)

I'm trying to extract a vocabulary of unigrams, bigrams, and trigrams using SkLearn's TfidfVectorizer. This is my current code:

 max_df_param =  .003
 use_idf = True

 vectorizer = TfidfVectorizer(max_df = max_df_param, stop_words='english', ngram_range=(1,1), max_features=2000, use_idf=use_idf)
 X = vectorizer.fit_transform(dataframe[column])
 unigrams = vectorizer.get_feature_names()

 vectorizer = TfidfVectorizer(max_df = max_df_param, stop_words='english', ngram_range=(2,2), max_features=max(1, int(len(unigrams)/10)), use_idf=use_idf)
 X = vectorizer.fit_transform(dataframe[column])
 bigrams = vectorizer.get_feature_names()

 vectorizer = TfidfVectorizer(max_df = max_df_param, stop_words='english', ngram_range=(3,3), max_features=max(1, int(len(unigrams)/10)), use_idf=use_idf)
 X = vectorizer.fit_transform(dataframe[column])
 trigrams = vectorizer.get_feature_names()

 vocab = np.concatenate((unigrams, bigrams, trigrams))

However, I would like to avoid numbers and words that contain numbers and the current output contains terms such as "0 101 110 12 15th 16th 180c 180d 18th 190 1900 1960s 197 1980 1b 20 200 200a 2d 3d 416 4th 50 7a 7b"

I try to only include words with alphabetical characters using the token_pattern parameter with the following regex:

vectorizer = TfidfVectorizer(max_df = max_df_param, 
                            token_pattern=u'(?u)\b\^[A-Za-z]+$\b', 
                            stop_words='english', ngram_range=(1,1), max_features=2000, use_idf=use_idf)

but this returns: ValueError: empty vocabulary; perhaps the documents only contain stop words

I have also tried only removing numbers but I still get the same error.

Is my regex incorrect? or am I using the TfidfVectorizer incorrectly? (I have also tried removing max_features argument)

Thank you!

like image 542
Matt Avatar asked Aug 01 '18 23:08

Matt


People also ask

What is the difference between TfidfVectorizer and TfidfTransformer?

The main difference between the 2 implementations is that TfidfVectorizer performs both term frequency and inverse document frequency for you, while using TfidfTransformer will require you to use the CountVectorizer class from Scikit-Learn to perform Term Frequency.

What is the difference between CountVectorizer and TfidfVectorizer?

TF-IDF is better than Count Vectorizers because it not only focuses on the frequency of words present in the corpus but also provides the importance of the words. We can then remove the words that are less important for analysis, hence making the model building less complex by reducing the input dimensions.

Does TfidfVectorizer remove stop words?

From the way the TfIdf score is set up, there shouldn't be any significant difference in removing the stopwords. The whole point of the Idf is exactly to remove words with no semantic value from the corpus. If you do add the stopwords, the Idf should get rid of it.

Does TfidfVectorizer do Stemming?

In particular, we pass the TfIdfVectorizer our own function that performs custom tokenization and stemming, but we use scikit-learn's built in stop word remove rather than NLTK's.


1 Answers

Thats because your regex is wrong.

1) You are using ^ and $ which are used to denote string start and end. That means this pattern will only match complete string with only alphabets in it (no numbers, no spaces, no other special chars). You dont want that. So remove that.

See the details about special characters here: https://docs.python.org/3/library/re.html#regular-expression-syntax

2) You are using raw regex pattern without escaping the backslash which will itself be used for escaping the characters following it. So when used in conjuction with regular expressions in python, this will not be valid as you want to. You can either properly format the string by using double backslashes instead of single or use r prefix.

3) u prefix is for unicode. Unless your regex pattern have special unicode characters, this is also not needed. See more about that here: Python regex - r prefix

So finally your correct token_pattern should be:

token_pattern=r'(?u)\b[A-Za-z]+\b'
like image 53
Vivek Kumar Avatar answered Sep 27 '22 23:09

Vivek Kumar