I'm trying to extract a vocabulary of unigrams, bigrams, and trigrams using SkLearn's TfidfVectorizer. This is my current code:
max_df_param = .003
use_idf = True
vectorizer = TfidfVectorizer(max_df = max_df_param, stop_words='english', ngram_range=(1,1), max_features=2000, use_idf=use_idf)
X = vectorizer.fit_transform(dataframe[column])
unigrams = vectorizer.get_feature_names()
vectorizer = TfidfVectorizer(max_df = max_df_param, stop_words='english', ngram_range=(2,2), max_features=max(1, int(len(unigrams)/10)), use_idf=use_idf)
X = vectorizer.fit_transform(dataframe[column])
bigrams = vectorizer.get_feature_names()
vectorizer = TfidfVectorizer(max_df = max_df_param, stop_words='english', ngram_range=(3,3), max_features=max(1, int(len(unigrams)/10)), use_idf=use_idf)
X = vectorizer.fit_transform(dataframe[column])
trigrams = vectorizer.get_feature_names()
vocab = np.concatenate((unigrams, bigrams, trigrams))
However, I would like to avoid numbers and words that contain numbers and the current output contains terms such as "0 101 110 12 15th 16th 180c 180d 18th 190 1900 1960s 197 1980 1b 20 200 200a 2d 3d 416 4th 50 7a 7b"
I try to only include words with alphabetical characters using the token_pattern
parameter with the following regex:
vectorizer = TfidfVectorizer(max_df = max_df_param,
token_pattern=u'(?u)\b\^[A-Za-z]+$\b',
stop_words='english', ngram_range=(1,1), max_features=2000, use_idf=use_idf)
but this returns: ValueError: empty vocabulary; perhaps the documents only contain stop words
I have also tried only removing numbers but I still get the same error.
Is my regex incorrect? or am I using the TfidfVectorizer
incorrectly? (I have also tried removing max_features
argument)
Thank you!
The main difference between the 2 implementations is that TfidfVectorizer performs both term frequency and inverse document frequency for you, while using TfidfTransformer will require you to use the CountVectorizer class from Scikit-Learn to perform Term Frequency.
TF-IDF is better than Count Vectorizers because it not only focuses on the frequency of words present in the corpus but also provides the importance of the words. We can then remove the words that are less important for analysis, hence making the model building less complex by reducing the input dimensions.
From the way the TfIdf score is set up, there shouldn't be any significant difference in removing the stopwords. The whole point of the Idf is exactly to remove words with no semantic value from the corpus. If you do add the stopwords, the Idf should get rid of it.
In particular, we pass the TfIdfVectorizer our own function that performs custom tokenization and stemming, but we use scikit-learn's built in stop word remove rather than NLTK's.
Thats because your regex is wrong.
1) You are using ^
and $
which are used to denote string start and end. That means this pattern will only match complete string with only alphabets in it (no numbers, no spaces, no other special chars). You dont want that. So remove that.
See the details about special characters here: https://docs.python.org/3/library/re.html#regular-expression-syntax
2) You are using raw regex pattern without escaping the backslash which will itself be used for escaping the characters following it. So when used in conjuction with regular expressions in python, this will not be valid as you want to. You can either properly format the string by using double backslashes instead of single or use r
prefix.
3) u
prefix is for unicode. Unless your regex pattern have special unicode characters, this is also not needed.
See more about that here: Python regex - r prefix
So finally your correct token_pattern should be:
token_pattern=r'(?u)\b[A-Za-z]+\b'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With