Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combine Sklearn TFIDF with Additional Data

I am trying to prepare data for supervised learning. I have my Tfidf data, which was generated from a column in my dataframe called "merged"

vect = TfidfVectorizer(stop_words='english', use_idf=True, min_df=50, ngram_range=(1,2))
X = vect.fit_transform(merged['kws_name_desc'])
print X.shape
print type(X)

(57629, 11947)
<class 'scipy.sparse.csr.csr_matrix'>

But I also need to add additional columns to this matrix. For each document in the TFIDF matrix, I have a list of additional numeric features. Each list is length 40 and it's comprised of floats.

So for clarify, I have 57,629 lists of length 40 which I'd like to append on to my TDIDF result.

Currently, I have this in a DataFrame, example data: merged["other_data"]. Below is an example row from the merged["other_data"]

0.4329597715,0.3637511039,0.4893141843,0.35840...   

How can I append the 57,629 rows of my dataframe column with the TF-IDF matrix? I honestly don't know where to begin and would appreciate any pointers/guidance.

like image 290
jrjames83 Avatar asked Nov 13 '16 03:11

jrjames83


2 Answers

This will do the work.

`df1 = pd.DataFrame(X.toarray())   //Convert sparse matrix to array
 df2 = YOUR_DF of size 57k x 40

 newDf = pd.concat([df1, df2], axis = 1)`//newDf is the required dataframe
like image 197
eshb Avatar answered Sep 30 '22 17:09

eshb


I figured it out:

First: iterate over my pandas column and create a list of lists

for_np = []

for x in merged['other_data']:
    row = x.split(",")
    row2 = map(float, row)
    for_np.append(row2)

Then create a np array:

n = np.array(for_np)

Then use scipy.sparse.hstack on X (my original tfidf sparse matrix and my new matrix. I'll probably end-up reweighting these 40-d vectors if they do not improve the classification results, but this approach worked!

import scipy.sparse

X = scipy.sparse.hstack([X, n])
like image 41
jrjames83 Avatar answered Sep 30 '22 17:09

jrjames83