I have used the
sklearn.preprocessing.OneHotEncoder
to transform some data the output is scipy.sparse.csr.csr_matrix
how can I merge it back into my original dataframe along with the other columns?
I tried to use pd.concat but I get 
TypeError: cannot concatenate a non-NDFrame object
Thanks
from_spmatrix() function. The sparse-from_spmatrix() function is used to create a new DataFrame from a scipy sparse matrix. Must be convertible to csc format. Row and column labels to use for the resulting DataFrame.
1 Answer. You can use either todense() or toarray() function to convert a CSR matrix to a dense matrix.
If A is csr_matrix, you can use .toarray() (there's also .todense() that produces a numpy matrix, which is also works for the DataFrame constructor):
df = pd.DataFrame(A.toarray())
You can then use this with pd.concat().
A = csr_matrix([[1, 0, 2], [0, 3, 0]])
    
  (0, 0)    1
  (0, 2)    2
  (1, 1)    3
<class 'scipy.sparse.csr.csr_matrix'>
pd.DataFrame(A.todense())
   0  1  2
0  1  0  2
1  0  3  0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
0    2 non-null int64
1    2 non-null int64
2    2 non-null int64
In version 0.20, pandas introduced sparse data structures, including the SparseDataFrame.
In pandas 1.0, SparseDataFrame was removed:
In older versions of pandas, the
SparseSeriesandSparseDataFrameclasses were the preferred way to work with sparse data. With the advent of extension arrays, these subclasses are no longer needed. Their purpose is better served by using a regular Series or DataFrame with sparse values instead.
The migration guide shows how to use these new data structures.
For instance, to create a DataFrame from a sparse matrix:
from scipy.sparse import csr_matrix
A = csr_matrix([[1, 0, 2], [0, 3, 0]])
df = pd.DataFrame.sparse.from_spmatrix(A, columns=['A', 'B', 'C'])
df
   A  B  C
0  1  0  2
1  0  3  0
df.dtypes
A    Sparse[float64, 0]
B    Sparse[float64, 0]
C    Sparse[float64, 0]
dtype: object
Alternatively, you can pass sparse matrices to sklearn to avoid running out of memory when converting back to pandas. Just convert your other data to sparse format by passing a numpy array to the scipy.sparse.csr_matrix constructor and use scipy.sparse.hstack to combine (see docs).
Per the Pandas Sparse data structures documentation, SparseDataFrame and SparseSeries have been removed.
pd.SparseDataFrame({"A": [0, 1]})
pd.DataFrame({"A": pd.arrays.SparseArray([0, 1])})
csr_matrix
from scipy.sparse import csr_matrix
matrix = csr_matrix((3, 4), dtype=np.int8)
df = pd.SparseDataFrame(matrix, columns=['A', 'B', 'C'])
from scipy.sparse import csr_matrix
import numpy as np
import pandas as pd
matrix = csr_matrix((3, 4), dtype=np.int8)
df = pd.DataFrame.sparse.from_spmatrix(matrix, columns=['A', 'B', 'C', 'D'])
df.dtypes
Output:
A    Sparse[int8, 0]
B    Sparse[int8, 0]
C    Sparse[int8, 0]
D    Sparse[int8, 0]
dtype: object
df.sparse.to_dense()                                                                                                                                                                            
Output:
   A  B  C  D
0  0  0  0  0
1  0  0  0  0
2  0  0  0  0
df.sparse.density                                                                                                                                                                           
Output:
0.0
                        You could also avoid getting back a sparse matrix in the first place by setting the parameter sparse to False when creating the Encoder.
The documentation of the OneHotEncoder states:
sparse : boolean, default=True
Will return sparse matrix if set True else will return an array.
Then you can again call the DataFrame constructor to transform the numpy array to a DataFrame.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With