I have a bunch of 14784 text documents, which I am trying to vectorize, so I can run some analysis. I used the <code>CountVectorizer</code> in sklearn, to convert the documents to feature vectors. I did this by calling: <pre class="prettyprint"><code>vectorizer = CountVectorizer features = vectorizer.fit_transform(examples) </code></pre> where examples is an array of all the text documents Now, I am trying to use additional features. For this, I am storing the features in a pandas dataframe. At present, my pandas dataframe(without inserting the text features) has the shape <code>(14784, 5)</code>. The shape of my feature vector is <code>(14784, 21343)</code>. What would be a good way to insert the vectorized features into the pandas dataframe?

Return term-document matrix after learning the vocab dictionary from the raw documents. <pre class="prettyprint"><code>X = vect.fit_transform(docs) </code></pre> Convert sparse csr matrix to dense format and allow columns to contain the array mapping from feature integer indices to feature names. <pre class="prettyprint"><code>count_vect_df = pd.DataFrame(X.todense(), columns=vect.get_feature_names()) </code></pre> Concatenate the original <code>df</code> and the <code>count_vect_df</code> columnwise. <pre class="prettyprint"><code>pd.concat([df, count_vect_df], axis=1) </code></pre>

Insert result of sklearn CountVectorizer in a pandas dataframe

Tags:

python

pandas

machine-learning

scikit-learn

I have a bunch of 14784 text documents, which I am trying to vectorize, so I can run some analysis. I used the CountVectorizer in sklearn, to convert the documents to feature vectors. I did this by calling:

vectorizer = CountVectorizer
features = vectorizer.fit_transform(examples)

where examples is an array of all the text documents

Now, I am trying to use additional features. For this, I am storing the features in a pandas dataframe. At present, my pandas dataframe(without inserting the text features) has the shape (14784, 5). The shape of my feature vector is (14784, 21343).

What would be a good way to insert the vectorized features into the pandas dataframe?

667

asked Nov 02 '16 00:11

Saurabh Sood

1 Answers

Return term-document matrix after learning the vocab dictionary from the raw documents.

X = vect.fit_transform(docs)

Convert sparse csr matrix to dense format and allow columns to contain the array mapping from feature integer indices to feature names.

count_vect_df = pd.DataFrame(X.todense(), columns=vect.get_feature_names())

Concatenate the original df and the count_vect_df columnwise.

pd.concat([df, count_vect_df], axis=1)

197

answered Oct 12 '22 03:10

Nickil Maveli

Related questions
                            
                                How to import a mysqldump into Pandas
                            
                                How to plot individual points without curve in python?
                            
                                Does the order of decorators matter on a Flask view?
                            
                                Pandas: Incrementally count occurrences in a column
                            
                                How to convert utf-8 fancy quotes to neutral quotes
                            
                                How to calculate p-value for two lists of floats?
                            
                                How to read numbers in python from csv file?
                            
                                Testing Flask routes do and don't exist
                            
                                Jupyter notebook run all cells on open
                            
                                Unzip buffer with Python?
                            
                                Getting parent of AST node in Python
                            
                                Elegant iteration over five dice
                            
                                pandas cross join no columns in common [duplicate]
                            
                                python numpy.savetxt header has extra character #
                            
                                Iterate through a dataframe by index
                            
                                PyCharm debugger fails with AttributeError
                            
                                Django - How to filter by date with Django Rest Framework?
                            
                                Display a pandas data frame with Bokeh
                            
                                convert pandas dataframe column from hex string to int
                            
                                Is a python dict comprehension always "last wins" if there are duplicate keys

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With