data frame of tfidf with Python

Tags:

I have to classify some sentiments my data frame is like this

Phrase                      Sentiment    
is it  good movie          positive    
wooow is it very goode      positive    
bad movie                  negative

I did some preprocessing as tokenisation stop words stemming etc ... and I get

Phrase                      Sentiment    
[ good , movie  ]        positive    
[wooow ,is , it ,very, good  ]   positive 
[bad , movie ]            negative

I need finally to get a dataframe in which the line are the text which the value is the tf_idf and the columns are the words like that

good     movie   wooow    very      bad                Sentiment
tf idf    tfidf_  tfidf    tf_idf    tf_idf               positive
(same thing for the 2 remaining lines)

802

asked Jan 27 '17 22:01

Amal Kostali Targhi

1 Answers

I'd use sklearn.feature_extraction.text.TfidfVectorizer, which is specifically designed for such tasks:

Demo:

In [63]: df
Out[63]:
                   Phrase Sentiment
0       is it  good movie  positive
1  wooow is it very goode  positive
2               bad movie  negative

Solution:

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', stop_words='english')

X = vect.fit_transform(df.pop('Phrase')).toarray()

r = df[['Sentiment']].copy()

del df

df = pd.DataFrame(X, columns=vect.get_feature_names())

del X
del vect

r.join(df)

Result:

In [31]: r.join(df)
Out[31]:
  Sentiment  bad  good     goode     wooow
0  positive  0.0   1.0  0.000000  0.000000
1  positive  0.0   0.0  0.707107  0.707107
2  negative  1.0   0.0  0.000000  0.000000

UPDATE: memory saving solution:

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', stop_words='english')

X = vect.fit_transform(df.pop('Phrase')).toarray()

for i, col in enumerate(vect.get_feature_names()):
    df[col] = X[:, i]

UPDATE2: related question where the memory issue was finally solved

122

answered Oct 04 '22 02:10

MaxU - stop WAR against UA

Related questions
                            
                                Is there a way to prevent dtype from changing from Int64 to float64 when reindexing/upsampling a time-series?
                            
                                Observations meaning - OpenAI Gym
                            
                                Type and default input value of a Click.option in --help option
                            
                                Use module as class instance in Python
                            
                                error when using keras' sk-learn API
                            
                                How to configure ruamel.yaml.dump output?
                            
                                How to use the green "Attach Debugger" button in Python console using PyCharm
                            
                                Using python's multiprocessing on slurm
                            
                                Inherit from scikit-learn's LassoCV model
                            
                                How to format the entries in Gtk.Entry
                            
                                Virtualenv and Pip hanging forever
                            
                                Replace pickle in Python multiprocessing lib
                            
                                Cython C++ templates
                            
                                Python - with open() except (FileNotFoundError)? [duplicate]
                            
                                GitHub GraphQL API Problems parsing JSON
                            
                                alternative parametrization of the negative binomial in scipy
                            
                                generate random dates within a range in numpy
                            
                                Tensorflow fail with "Unable to get element from the feed as bytes." when attempting to restore checkpoint
                            
                                usage of except and store error in a variable
                            
                                How does __del__() interfere with garbage collection?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

data frame of tfidf with Python

Tags:

python

pandas

dataframe

text-mining

tf-idf

Amal Kostali Targhi

People also ask

1 Answers

MaxU - stop WAR against UA

Recent Activity

Donate For Us