Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Append tfidf to pandas dataframe

I have the following pandas structure:

col1 col2 col3 text
1    1    0    meaningful text
5    9    7    trees
7    8    2    text

I'd like to vectorise it using a tfidf vectoriser. This, however, returns a parse matrix, which I can actually turn into a dense matrix via mysparsematrix).toarray(). However, how can I add this info with labels to my original df? So the target would look like:

col1 col2 col3 meaningful text trees
1    1    0    1          1    0
5    9    7    0          0    1
7    8    2    0          1    0

UPDATE:

Solution makes the concatenation wrong even when renaming original columns: enter image description here Dropping columns with at least one NaN results in only 7 rows left, even though I use fillna(0) before starting to work with it.

like image 657
lte__ Avatar asked Aug 30 '17 13:08

lte__


People also ask

How to append Dataframe in pandas?

Pandas is one of those packages and makes importing and analyzing data much easier. Pandas dataframe.append() function is used to append rows of other dataframe to the end of the given dataframe, returning a new dataframe object. Columns not in the original dataframes are added as new columns and the new cells are populated with NaN value.

How to append rows of other in a Dataframe?

Append rows of other to the end of caller, returning a new object. Deprecated since version 1.4.0: Use concat () instead. For further details see Deprecated DataFrame.append and Series.append Columns in other that are not in the caller are added as new columns. The data to append. If True, the resulting axis will be labeled 0, 1, …, n - 1.

What does ignore_index=true do in pandas?

The result is one big DataFrame that contains all of the rows from each of the three individual DataFrames. The argument ignore_index=True tells pandas to ignore the original index numbers in each DataFrame and to create a new index that starts at 0 for the new DataFrame.

Should I ignore index when appending multiple DataFrames to a Dataframe?

The resulting DataFrame kept its original index values from the two DataFrames. In general, you should use ignore_index=True when appending multiple DataFrames unless you have a specific reason for keeping the original index values.


3 Answers

I would like to add some information to the accepted answer.

Before concatenating the two DataFrames (i.e. main DataFrame and TF-IDF DataFrame), make sure that the indices between the two DataFrames are similar. For instance, you can use df.reset_index(drop=True, inplace=True) to reset the DataFrame index.

Otherwise, your concatenated DataFrames will contain a lot of NaN rows. Having looked at the comments, this is probably what the OP experienced.

like image 120
Glorian Avatar answered Oct 19 '22 07:10

Glorian


You can proceed as follows:

Load data into a dataframe:

import pandas as pd

df = pd.read_table("/tmp/test.csv", sep="\s+")
print(df)

Output:

   col1  col2  col3             text
0     1     1     0  meaningful text
1     5     9     7            trees
2     7     8     2             text

Tokenize the text column using: sklearn.feature_extraction.text.TfidfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

v = TfidfVectorizer()
x = v.fit_transform(df['text'])

Convert the tokenized data into a dataframe:

df1 = pd.DataFrame(x.toarray(), columns=v.get_feature_names())
print(df1)

Output:

   meaningful      text  trees
0    0.795961  0.605349    0.0
1    0.000000  0.000000    1.0
2    0.000000  1.000000    0.0

Concatenate the tokenization dataframe to the orignal one:

res = pd.concat([df, df1], axis=1)
print(res)

Output:

   col1  col2  col3             text  meaningful      text  trees
0     1     1     0  meaningful text    0.795961  0.605349    0.0
1     5     9     7            trees    0.000000  0.000000    1.0
2     7     8     2             text    0.000000  1.000000    0.0

If you want to drop the column text, you need to do that before the concatenation:

df.drop('text', axis=1, inplace=True)
res = pd.concat([df, df1], axis=1)
print(res)

Output:

   col1  col2  col3  meaningful      text  trees
0     1     1     0    0.795961  0.605349    0.0
1     5     9     7    0.000000  0.000000    1.0
2     7     8     2    0.000000  1.000000    0.0

Here's the full code:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

df = pd.read_table("/tmp/test.csv", sep="\s+")
v = TfidfVectorizer()
x = v.fit_transform(df['text'])

df1 = pd.DataFrame(x.toarray(), columns=v.get_feature_names())
df.drop('text', axis=1, inplace=True)
res = pd.concat([df, df1], axis=1)
like image 33
Mohamed Ali JAMAOUI Avatar answered Oct 19 '22 07:10

Mohamed Ali JAMAOUI


You can try the following -

import numpy as np 
import pandas as pd 
from sklearn.feature_extraction.text import TfidfVectorizer

# create some data
col1 = np.asarray(np.random.choice(10,size=(10)))
col2 = np.asarray(np.random.choice(10,size=(10)))
col3 = np.asarray(np.random.choice(10,size=(10)))
text = ['Some models allow for specialized',
         'efficient parameter search strategies,',
         'outlined below. Two generic approaches',
         'to sampling search candidates are ',
         'provided in scikit-learn: for given values,',
         'GridSearchCV exhaustively considers all',
         'parameter combinations, while RandomizedSearchCV',
         'can sample a given number of candidates',
         ' from a parameter space with a specified distribution.',
         ' After describing these tools we detail best practice applicable to both approaches.']

# create a dataframe from the the created data
df = pd.DataFrame([col1,col2,col3,text]).T
# set column names
df.columns=['col1','col2','col3','text']

tfidf_vec = TfidfVectorizer()
tfidf_dense = tfidf_vec.fit_transform(df['text']).todense()
new_cols = tfidf_vec.get_feature_names()

# remove the text column as the word 'text' may exist in the words and you'll get an error
df = df.drop('text',axis=1)
# join the tfidf values to the existing dataframe
df = df.join(pd.DataFrame(tfidf_dense, columns=new_cols))
like image 4
Clock Slave Avatar answered Oct 19 '22 06:10

Clock Slave