Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Concatenating DataFrame with Sparse Matrix

I'm doing some basic machine learning and have a sparse matrix resulting from TFIDF as follows:

<983x33599 sparse matrix of type '<type 'numpy.float64'>'
    with 232944 stored elements in Compressed Sparse Row format>

Then I have a DataFrame with a title column. I want to combine these into one DataFrame but when I try to use concat, I get that I can't combine a DataFrame with a non-DataFrame object.

How do I get around this?

Thanks!

like image 666
anon_swe Avatar asked Jun 28 '17 19:06

anon_swe


People also ask

What is the difference between merging and concatenation in pandas?

merge() for combining data on common columns or indices. . join() for combining data on a key column or an index. concat() for combining DataFrames across rows or columns.

How do you concatenate in pandas?

concat() function does all the heavy lifting of performing concatenation operations along with an axis od Pandas objects while performing optional set logic (union or intersection) of the indexes (if any) on the other axes. Parameters: objs: Series or DataFrame objects. axis: axis to concatenate along; default = 0.

How do you create a sparse DataFrame in Python?

Use DataFrame. sparse. from_spmatrix() to create a DataFrame with sparse values from a sparse matrix.


1 Answers

Consider the following demo:

Source DF:

In [2]: df
Out[2]:
                     text
0       is it  good movie
1  wooow is it very goode
2               bad movie

Solution: let's create a SparseDataFrame out of TFIDF sparse matrix:

from sklearn.feature_extraction.text import TfidfVectorizer

vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', stop_words='english')

sdf = pd.SparseDataFrame(vect.fit_transform(df['text']),
                         columns=vect.get_feature_names(), 
                         default_fill_value=0)
sdf['text'] = df['text']

Result:

In [13]: sdf
Out[13]:
   bad  good     goode     wooow                    text
0  0.0   1.0  0.000000  0.000000       is it  good movie
1  0.0   0.0  0.707107  0.707107  wooow is it very goode
2  1.0   0.0  0.000000  0.000000               bad movie

In [14]: sdf.memory_usage()
Out[14]:
Index    80
bad       8
good      8
goode     8
wooow     8
text     24
dtype: int64

PS pay attention at .memory_usage() - we didn't lose the "spareness". If we would use pd.concat, join, merge, etc. - we would lose the "sparseness" as all these methods generate a new regular (not sparsed) copy of merged DataFrames

like image 70
MaxU - stop WAR against UA Avatar answered Sep 28 '22 03:09

MaxU - stop WAR against UA