I have the following pandas structure: <pre class="prettyprint"><code>col1 col2 col3 text 1 1 0 meaningful text 5 9 7 trees 7 8 2 text </code></pre> I'd like to vectorise it using a tfidf vectoriser. This, however, returns a parse matrix, which I can actually turn into a dense matrix via <code>mysparsematrix).toarray()</code>. However, how can I add this info with labels to my original df? So the target would look like: <pre class="prettyprint"><code>col1 col2 col3 meaningful text trees 1 1 0 1 1 0 5 9 7 0 0 1 7 8 2 0 1 0 </code></pre> UPDATE: Solution makes the concatenation wrong even when renaming original columns: <img src="https://i.stack.imgur.com/6YgLH.png" alt="enter image description here"> Dropping columns with at least one NaN results in only 7 rows left, even though I use <code>fillna(0)</code> before starting to work with it.

You can proceed as follows: Load data into a dataframe: <pre class="prettyprint"><code>import pandas as pd df = pd.read_table("/tmp/test.csv", sep="\s+") print(df) </code></pre> Output: <pre class="prettyprint"><code> col1 col2 col3 text 0 1 1 0 meaningful text 1 5 9 7 trees 2 7 8 2 text </code></pre> Tokenize the <code>text</code> column using: <code>sklearn.feature_extraction.text.TfidfVectorizer</code> <pre class="prettyprint"><code>from sklearn.feature_extraction.text import TfidfVectorizer v = TfidfVectorizer() x = v.fit_transform(df['text']) </code></pre> Convert the tokenized data into a dataframe: <pre class="prettyprint"><code>df1 = pd.DataFrame(x.toarray(), columns=v.get_feature_names()) print(df1) </code></pre> Output: <pre class="prettyprint"><code> meaningful text trees 0 0.795961 0.605349 0.0 1 0.000000 0.000000 1.0 2 0.000000 1.000000 0.0 </code></pre> Concatenate the tokenization dataframe to the orignal one: <pre class="prettyprint"><code>res = pd.concat([df, df1], axis=1) print(res) </code></pre> Output: <pre class="prettyprint"><code> col1 col2 col3 text meaningful text trees 0 1 1 0 meaningful text 0.795961 0.605349 0.0 1 5 9 7 trees 0.000000 0.000000 1.0 2 7 8 2 text 0.000000 1.000000 0.0 </code></pre> If you want to drop the column <code>text</code>, you need to do that before the concatenation: <pre class="prettyprint"><code>df.drop('text', axis=1, inplace=True) res = pd.concat([df, df1], axis=1) print(res) </code></pre> Output: <pre class="prettyprint"><code> col1 col2 col3 meaningful text trees 0 1 1 0 0.795961 0.605349 0.0 1 5 9 7 0.000000 0.000000 1.0 2 7 8 2 0.000000 1.000000 0.0 </code></pre> <hr> Here's the full code: <pre class="prettyprint"><code>import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer df = pd.read_table("/tmp/test.csv", sep="\s+") v = TfidfVectorizer() x = v.fit_transform(df['text']) df1 = pd.DataFrame(x.toarray(), columns=v.get_feature_names()) df.drop('text', axis=1, inplace=True) res = pd.concat([df, df1], axis=1) </code></pre>

Append tfidf to pandas dataframe

Tags:

python

dataframe

tf-idf

sklearn-pandas

I have the following pandas structure:

col1 col2 col3 text
1    1    0    meaningful text
5    9    7    trees
7    8    2    text

I'd like to vectorise it using a tfidf vectoriser. This, however, returns a parse matrix, which I can actually turn into a dense matrix via mysparsematrix).toarray(). However, how can I add this info with labels to my original df? So the target would look like:

col1 col2 col3 meaningful text trees
1    1    0    1          1    0
5    9    7    0          0    1
7    8    2    0          1    0

UPDATE:

Solution makes the concatenation wrong even when renaming original columns: enter image description here Dropping columns with at least one NaN results in only 7 rows left, even though I use fillna(0) before starting to work with it.

657

asked Aug 30 '17 13:08

lte__

3 Answers

I would like to add some information to the accepted answer.

Before concatenating the two DataFrames (i.e. main DataFrame and TF-IDF DataFrame), make sure that the indices between the two DataFrames are similar. For instance, you can use df.reset_index(drop=True, inplace=True) to reset the DataFrame index.

Otherwise, your concatenated DataFrames will contain a lot of NaN rows. Having looked at the comments, this is probably what the OP experienced.

120

answered Oct 19 '22 07:10

Glorian

You can proceed as follows:

Load data into a dataframe:

import pandas as pd

df = pd.read_table("/tmp/test.csv", sep="\s+")
print(df)

Output:

   col1  col2  col3             text
0     1     1     0  meaningful text
1     5     9     7            trees
2     7     8     2             text

Tokenize the text column using: sklearn.feature_extraction.text.TfidfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

v = TfidfVectorizer()
x = v.fit_transform(df['text'])

Convert the tokenized data into a dataframe:

df1 = pd.DataFrame(x.toarray(), columns=v.get_feature_names())
print(df1)

Output:

   meaningful      text  trees
0    0.795961  0.605349    0.0
1    0.000000  0.000000    1.0
2    0.000000  1.000000    0.0

Concatenate the tokenization dataframe to the orignal one:

res = pd.concat([df, df1], axis=1)
print(res)

Output:

   col1  col2  col3             text  meaningful      text  trees
0     1     1     0  meaningful text    0.795961  0.605349    0.0
1     5     9     7            trees    0.000000  0.000000    1.0
2     7     8     2             text    0.000000  1.000000    0.0

If you want to drop the column text, you need to do that before the concatenation:

df.drop('text', axis=1, inplace=True)
res = pd.concat([df, df1], axis=1)
print(res)

Output:

   col1  col2  col3  meaningful      text  trees
0     1     1     0    0.795961  0.605349    0.0
1     5     9     7    0.000000  0.000000    1.0
2     7     8     2    0.000000  1.000000    0.0

Here's the full code:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

df = pd.read_table("/tmp/test.csv", sep="\s+")
v = TfidfVectorizer()
x = v.fit_transform(df['text'])

df1 = pd.DataFrame(x.toarray(), columns=v.get_feature_names())
df.drop('text', axis=1, inplace=True)
res = pd.concat([df, df1], axis=1)

answered Oct 19 '22 07:10

Mohamed Ali JAMAOUI

You can try the following -

import numpy as np 
import pandas as pd 
from sklearn.feature_extraction.text import TfidfVectorizer

# create some data
col1 = np.asarray(np.random.choice(10,size=(10)))
col2 = np.asarray(np.random.choice(10,size=(10)))
col3 = np.asarray(np.random.choice(10,size=(10)))
text = ['Some models allow for specialized',
         'efficient parameter search strategies,',
         'outlined below. Two generic approaches',
         'to sampling search candidates are ',
         'provided in scikit-learn: for given values,',
         'GridSearchCV exhaustively considers all',
         'parameter combinations, while RandomizedSearchCV',
         'can sample a given number of candidates',
         ' from a parameter space with a specified distribution.',
         ' After describing these tools we detail best practice applicable to both approaches.']

# create a dataframe from the the created data
df = pd.DataFrame([col1,col2,col3,text]).T
# set column names
df.columns=['col1','col2','col3','text']

tfidf_vec = TfidfVectorizer()
tfidf_dense = tfidf_vec.fit_transform(df['text']).todense()
new_cols = tfidf_vec.get_feature_names()

# remove the text column as the word 'text' may exist in the words and you'll get an error
df = df.drop('text',axis=1)
# join the tfidf values to the existing dataframe
df = df.join(pd.DataFrame(tfidf_dense, columns=new_cols))

answered Oct 19 '22 06:10

Clock Slave

Related questions
                            
                                Missing errorbars when using yscale('log') at matplotlib
                            
                                Parsing a date that can be in several formats in python
                            
                                Python How to capitalize nth letter of a string
                            
                                First common element from two lists
                            
                                Numpy memory error creating huge matrix
                            
                                Browse for file path in python
                            
                                Fit a curve for data made up of two distinct regimes
                            
                                Flask-WTF / WTForms with Unittest fails validation, but works without Unittest
                            
                                Difference between using [] and list() in Python
                            
                                Using a websocket client as a class in python
                            
                                Django translations does not work
                            
                                Show the SQL generated by Flask-SQLAlchemy
                            
                                How to setup Atom's script to run Python 3.x scripts? May the combination with Windows 7 Pro x64 be the issue?
                            
                                Unable to install Python 3.5 within Windows XP Professional
                            
                                Pitch detection in Python
                            
                                django selenium LiveServerTestCase
                            
                                Overhead of creating classes in Python: Exact same code using class twice as slow as native DS?
                            
                                Parse human-readable filesizes into bytes
                            
                                AttributeError: 'Ui_MainWindow' object has no attribute 'setCentralWidget'
                            
                                Convert base64 String to an Image that's compatible with OpenCV

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Append tfidf to pandas dataframe

Tags:

python

dataframe

tf-idf

sklearn-pandas

lte__

People also ask

3 Answers

Glorian

Mohamed Ali JAMAOUI

Clock Slave

Recent Activity

Donate For Us