Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Insert result of sklearn CountVectorizer in a pandas dataframe

I have a bunch of 14784 text documents, which I am trying to vectorize, so I can run some analysis. I used the CountVectorizer in sklearn, to convert the documents to feature vectors. I did this by calling:

vectorizer = CountVectorizer
features = vectorizer.fit_transform(examples)

where examples is an array of all the text documents

Now, I am trying to use additional features. For this, I am storing the features in a pandas dataframe. At present, my pandas dataframe(without inserting the text features) has the shape (14784, 5). The shape of my feature vector is (14784, 21343).

What would be a good way to insert the vectorized features into the pandas dataframe?

like image 667
Saurabh Sood Avatar asked Nov 02 '16 00:11

Saurabh Sood


People also ask

How do you use a CountVectorizer?

Word Counts with CountVectorizer You can use it as follows: Create an instance of the CountVectorizer class. Call the fit() function in order to learn a vocabulary from one or more documents. Call the transform() function on one or more documents as needed to encode each as a vector.

How do I add data to a Pandas DataFrame?

append() function is used to append rows of other dataframe to the end of the given dataframe, returning a new dataframe object. Columns not in the original dataframes are added as new columns and the new cells are populated with NaN value.

How do I add a value to a DataFrame column?

You can use the assign() function to add a new column to the end of a pandas DataFrame: df = df. assign(col_name=[value1, value2, value3, ...])

What does .ADD do in pandas?

Pandas DataFrame add() Method The add() method adds each value in the DataFrame with a specified value. The specified value must be an object that can be added to the values of the DataFrame.


1 Answers

Return term-document matrix after learning the vocab dictionary from the raw documents.

X = vect.fit_transform(docs) 

Convert sparse csr matrix to dense format and allow columns to contain the array mapping from feature integer indices to feature names.

count_vect_df = pd.DataFrame(X.todense(), columns=vect.get_feature_names())

Concatenate the original df and the count_vect_df columnwise.

pd.concat([df, count_vect_df], axis=1)
like image 197
Nickil Maveli Avatar answered Oct 12 '22 03:10

Nickil Maveli