I have the following pandas structure:
col1 col2 col3 text
1 1 0 meaningful text
5 9 7 trees
7 8 2 text
I'd like to vectorise it using a tfidf vectoriser. This, however, returns a parse matrix, which I can actually turn into a dense matrix via mysparsematrix).toarray()
. However, how can I add this info with labels to my original df? So the target would look like:
col1 col2 col3 meaningful text trees
1 1 0 1 1 0
5 9 7 0 0 1
7 8 2 0 1 0
UPDATE:
Solution makes the concatenation wrong even when renaming original columns:
Dropping columns with at least one NaN results in only 7 rows left, even though I use fillna(0)
before starting to work with it.
Pandas is one of those packages and makes importing and analyzing data much easier. Pandas dataframe.append() function is used to append rows of other dataframe to the end of the given dataframe, returning a new dataframe object. Columns not in the original dataframes are added as new columns and the new cells are populated with NaN value.
Append rows of other to the end of caller, returning a new object. Deprecated since version 1.4.0: Use concat () instead. For further details see Deprecated DataFrame.append and Series.append Columns in other that are not in the caller are added as new columns. The data to append. If True, the resulting axis will be labeled 0, 1, …, n - 1.
The result is one big DataFrame that contains all of the rows from each of the three individual DataFrames. The argument ignore_index=True tells pandas to ignore the original index numbers in each DataFrame and to create a new index that starts at 0 for the new DataFrame.
The resulting DataFrame kept its original index values from the two DataFrames. In general, you should use ignore_index=True when appending multiple DataFrames unless you have a specific reason for keeping the original index values.
I would like to add some information to the accepted answer.
Before concatenating the two DataFrames (i.e. main DataFrame and TF-IDF DataFrame), make sure that the indices between the two DataFrames are similar. For instance, you can use df.reset_index(drop=True, inplace=True) to reset the DataFrame index.
Otherwise, your concatenated DataFrames will contain a lot of NaN rows. Having looked at the comments, this is probably what the OP experienced.
You can proceed as follows:
Load data into a dataframe:
import pandas as pd
df = pd.read_table("/tmp/test.csv", sep="\s+")
print(df)
Output:
col1 col2 col3 text
0 1 1 0 meaningful text
1 5 9 7 trees
2 7 8 2 text
Tokenize the text
column using: sklearn.feature_extraction.text.TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer()
x = v.fit_transform(df['text'])
Convert the tokenized data into a dataframe:
df1 = pd.DataFrame(x.toarray(), columns=v.get_feature_names())
print(df1)
Output:
meaningful text trees
0 0.795961 0.605349 0.0
1 0.000000 0.000000 1.0
2 0.000000 1.000000 0.0
Concatenate the tokenization dataframe to the orignal one:
res = pd.concat([df, df1], axis=1)
print(res)
Output:
col1 col2 col3 text meaningful text trees
0 1 1 0 meaningful text 0.795961 0.605349 0.0
1 5 9 7 trees 0.000000 0.000000 1.0
2 7 8 2 text 0.000000 1.000000 0.0
If you want to drop the column text
, you need to do that before the concatenation:
df.drop('text', axis=1, inplace=True)
res = pd.concat([df, df1], axis=1)
print(res)
Output:
col1 col2 col3 meaningful text trees
0 1 1 0 0.795961 0.605349 0.0
1 5 9 7 0.000000 0.000000 1.0
2 7 8 2 0.000000 1.000000 0.0
Here's the full code:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
df = pd.read_table("/tmp/test.csv", sep="\s+")
v = TfidfVectorizer()
x = v.fit_transform(df['text'])
df1 = pd.DataFrame(x.toarray(), columns=v.get_feature_names())
df.drop('text', axis=1, inplace=True)
res = pd.concat([df, df1], axis=1)
You can try the following -
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
# create some data
col1 = np.asarray(np.random.choice(10,size=(10)))
col2 = np.asarray(np.random.choice(10,size=(10)))
col3 = np.asarray(np.random.choice(10,size=(10)))
text = ['Some models allow for specialized',
'efficient parameter search strategies,',
'outlined below. Two generic approaches',
'to sampling search candidates are ',
'provided in scikit-learn: for given values,',
'GridSearchCV exhaustively considers all',
'parameter combinations, while RandomizedSearchCV',
'can sample a given number of candidates',
' from a parameter space with a specified distribution.',
' After describing these tools we detail best practice applicable to both approaches.']
# create a dataframe from the the created data
df = pd.DataFrame([col1,col2,col3,text]).T
# set column names
df.columns=['col1','col2','col3','text']
tfidf_vec = TfidfVectorizer()
tfidf_dense = tfidf_vec.fit_transform(df['text']).todense()
new_cols = tfidf_vec.get_feature_names()
# remove the text column as the word 'text' may exist in the words and you'll get an error
df = df.drop('text',axis=1)
# join the tfidf values to the existing dataframe
df = df.join(pd.DataFrame(tfidf_dense, columns=new_cols))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With