How can I use TF-IDF vectorizer
from the scikit-learn library to extract unigrams
and bigrams
of tweets? I want to train a classifier with the output.
This is the code from scikit-learn:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
Term frequency-inverse document frequency is a text vectorizer that transforms the text into a usable vector. It combines 2 concepts, Term Frequency (TF) and Document Frequency (DF). The term frequency is the number of occurrences of a specific term in a document.
The TF-IDF (term frequency-inverse document frequency) algorithm is based on word statistics for text feature extraction. Which considers only the expressions of words that are same in all texts, such as ASCLL, without considering that they could be represented by their synonyms.
TF-IDF is better than Count Vectorizers because it not only focuses on the frequency of words present in the corpus but also provides the importance of the words. We can then remove the words that are less important for analysis, hence making the model building less complex by reducing the input dimensions.
TF – IDF for Bigrams & Trigrams 1 TF.IDF = (TF). (IDF) 2 Bigrams: Bigram is 2 consecutive words in a sentence. “The boy is playing football”. 3 Trigrams: Trigram is 3 consecutive words in a sentence. From the above bigrams and trigram, some are relevant while others are discarded which do not contribute value for further processing.
Text vectorization algorithm namely TF-IDF vectorizer, which is a very popular approach for traditional machine learning algorithms can help in transforming text into vectors. Term frequency-inverse document frequency is a text vectorizer that transforms the text into a usable vector.
In the TF-IDF approach, words that are more common in one sentence and less common in others are given more weights, and since words are treated individually and every single word is converted to numeric form the context information is not retained, whereas N-Grams help us to retain the context.
Tf is Term frequency, and IDF is Inverse document frequency. This method is often used for information retrieval and text mining. We will take four reviews or the documents as our data corpus and store them in a list.
TfidfVectorizer
has an ngram_range
parameter to determin the range of n-grams you want in the final matrix as new features. In your case, you want (1,2)
to go from unigrams to bigrams:
vectorizer = TfidfVectorizer(ngram_range=(1,2))
X = vectorizer.fit_transform(corpus).todense()
pd.DataFrame(X, columns=vectorizer.get_feature_names())
and and this document document is first first document \
0 0.000000 0.000000 0.314532 0.000000 0.388510 0.388510
1 0.000000 0.000000 0.455513 0.356824 0.000000 0.000000
2 0.357007 0.357007 0.000000 0.000000 0.000000 0.000000
3 0.000000 0.000000 0.282940 0.000000 0.349487 0.349487
is is the is this one ... the the first \
0 0.257151 0.314532 0.000000 0.000000 ... 0.257151 0.388510
1 0.186206 0.227756 0.000000 0.000000 ... 0.186206 0.000000
2 0.186301 0.227873 0.000000 0.357007 ... 0.186301 0.000000
3 0.231322 0.000000 0.443279 0.000000 ... 0.231322 0.349487
...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With