TF-IDF vectorizer to extract ngrams

Tags:

How can I use TF-IDF vectorizer from the scikit-learn library to extract unigrams and bigrams of tweets? I want to train a classifier with the output.

This is the code from scikit-learn:

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

624

asked Oct 28 '20 08:10

ECub Devs

1 Answers

TfidfVectorizer has an ngram_range parameter to determin the range of n-grams you want in the final matrix as new features. In your case, you want (1,2) to go from unigrams to bigrams:

vectorizer = TfidfVectorizer(ngram_range=(1,2))
X = vectorizer.fit_transform(corpus).todense()

pd.DataFrame(X, columns=vectorizer.get_feature_names())

        and  and this  document  document is     first  first document  \
0  0.000000  0.000000  0.314532     0.000000  0.388510        0.388510   
1  0.000000  0.000000  0.455513     0.356824  0.000000        0.000000   
2  0.357007  0.357007  0.000000     0.000000  0.000000        0.000000   
3  0.000000  0.000000  0.282940     0.000000  0.349487        0.349487   

         is    is the   is this       one  ...       the  the first  \
0  0.257151  0.314532  0.000000  0.000000  ...  0.257151   0.388510   
1  0.186206  0.227756  0.000000  0.000000  ...  0.186206   0.000000   
2  0.186301  0.227873  0.000000  0.357007  ...  0.186301   0.000000   
3  0.231322  0.000000  0.443279  0.000000  ...  0.231322   0.349487   
...

159

answered Sep 28 '22 00:09

yatu

Related questions
                            
                                Package Python3.7 is not available
                            
                                error: Could not find a version that satisfies the requirement pprint (from -r requirements.txt (line 67)) (from versions: none)
                            
                                How to extract feature vector from single image in Pytorch?
                            
                                Yielding asyncio generator data back from event loop possible?
                            
                                Python Serverless Function Vercel - Next.js
                            
                                Google Calendar API drops "conferenceData" nested object
                            
                                Send a pandas dataframe to slack
                            
                                Python argparse select a list from choices
                            
                                Is there a convention for indicating a quantity's units in Python code?
                            
                                Extracting data from list in Python, after BeautifulSoup scrape, and creating Pandas table
                            
                                Pandas in df column extract string after colon if colon exits; if not, keep text
                            
                                Expand pandas dataframe and consolidate columns
                            
                                Pandas dataframe groupby make a list or array of a column
                            
                                python import path for sub modules if put in namespace package
                            
                                Python / Pyspark - Correct method chaining order rules
                            
                                Pandas melt multiple columns to tabulate a dataset
                            
                                Speed up random weighted choice without replacement in python
                            
                                Seaborn title error - AttributeError: 'FacetGrid' object has no attribute 'set_title
                            
                                How to disable scientific notation in hvPlot plots?
                            
                                How to speed up the performance of array masking from the results of numpy.searchsorted in python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

TF-IDF vectorizer to extract ngrams

Tags:

python

scikit-learn

n-gram

tfidfvectorizer

ECub Devs

People also ask

1 Answers

yatu

Recent Activity

Donate For Us