Have data with more then 1 million rows and 30 columns, one of the columns is user_id (more then 1500 different users). I want one-hot-encode this column and to use data in ML algorithms (xgboost, FFM, scikit). But due to huge row numbers and unique user values matrix will be ~ 1 million X 1500, so need do this in sparse format (otherwise data kill all RAM).
For me convenient way to work with data through pandas DataFrame, which also now it support sparse format:
df = pd.get_dummies(df, columns=['user_id', 'type'], sparse=True)
Work pretty fast and have small size in RAM. But for working with scikit algos and xgboost it's necessary transform dataframe to sparse matrix.
Is there any way to do this rather than iterate through columns and hstack them in one scipy sparse matrix? I tried df.as_matrix() and df.values, but all of first transform data to dense what arise MemoryError :(
P.S. Same to get DMatrix for xgboost
UPDATE:
So i release next solution (will be thankful for optimisation suggestions):
def sparse_df_to_saprse_matrix (sparse_df):
index_list = sparse_df.index.values.tolist()
matrix_columns = []
sparse_matrix = None
for column in sparse_df.columns:
sps_series = sparse_df[column]
sps_series.index = pd.MultiIndex.from_product([index_list, [column]])
curr_sps_column, rows, cols = sps_series.to_coo()
if sparse_matrix != None:
sparse_matrix = sparse.hstack([sparse_matrix, curr_sps_column])
else:
sparse_matrix = curr_sps_column
matrix_columns.extend(cols)
return sparse_matrix, index_list, matrix_columns
And the following code allows to get sparse dataframe:
one_hot_df = pd.get_dummies(df, columns=['user_id', 'type'], sparse=True)
full_sparse_df = one_hot_df.to_sparse(fill_value=0)
I have created sparse matrix 1,1 million rows x 1150 columns. But during creating it's still uses significant amount of RAM (~10Gb on edge with my 12Gb).
Don't know why, because resulting sparse matrix uses only 300 Mb (after loading from HDD). Any ideas?
You should be able to use the experimental .to_coo()
method in pandas [1] in the following way:
one_hot_df = pd.get_dummies(df, columns=['user_id', 'type'], sparse=True)
one_hot_df, idx_rows, idx_cols = one_hot_df.stack().to_sparse().to_coo()
This method, instead of taking a DataFrame
(rows / columns) it takes a Series
with rows and columns in a MultiIndex
(this is why you need the .stack()
method). This Series
with the MultiIndex
needs to be a SparseSeries
, and even if your input is a SparseDataFrame
, .stack()
returns a regular Series
. So, you need to use the .to_sparse()
method before calling .to_coo()
.
The Series
returned by .stack()
, even if it's not a SparseSeries
only contains the elements that are not null, so it shouldn't take more memory than the sparse version (at least with np.nan
when the type is np.float
).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With