Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas : ValueError ( any way to convert Sparse[float64, 0.0] dtypes to float64 datatype )

I have a dataframe X_train to which i am concatenating a couple of another dataframe. This second & third dataframe is obtained from sparse matrix which has been been generated by a TF-IDF VEctorizer

q1_train_df = pd.DataFrame.sparse.from_spmatrix(q1_tdidf_train,index=X_train.index,columns=q1_features)
q2_train_df = pd.DataFrame.sparse.from_spmatrix(q2_tdidf_train,index=X_train.index,columns=q2_features)
X_train_final  = pd.concat([X_train,q1_train_df,q2_train_df],axis=1)

X_train_final dtypes is looking as below


X_train_final.dtypes

cwc_min                       float64
cwc_max                       float64
csc_min                       float64
csc_max                       float64
ctc_min                       float64
                         ...         
q2_zealand       Sparse[float64, 0.0]
q2_zero          Sparse[float64, 0.0]
q2_zinc          Sparse[float64, 0.0]
q2_zone          Sparse[float64, 0.0]
q2_zuckerberg    Sparse[float64, 0.0]
Length: 10015, dtype: object

I am using XGBoost to train on this final dataframe and this is throwing error while trying to fit the data

model.fit( X_train_final,y_train)


ValueError: DataFrame.dtypes for data must be int, float or bool.
                Did not expect the data types in fields q1_04, q1_10, q1_100, q

I think the error is due to Sparse[float64,0.0] dtypes present in it . Can you please help me out, not able to figure out how to get out of this error ??

like image 766
Atish Avatar asked Nov 24 '25 04:11

Atish


1 Answers

I actually just came across the same exact issue. I have a list of columns that were generated using TF-IDF vectorizor and I was attempting to use XGBoost on the dataset.

This ended up working for me:

import xgboost as xgb


train_df = train_df.apply(pd.to_numeric, errors='coerce')

train_df[tf_idf_column_names] = train_df[tf_idf_column_names].sparse.to_dense()

train_x = train_df.iloc[:,1:]

train_y = train_df.iloc[:,:1]

dtrain= xgb.DMatrix(data=train_x, label=train_y)

param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic'}

num_round = 2

bst = xgb.train(param, dtrain, num_round)

preds = bst.predict(dtest)
like image 166
sunshinedrinker Avatar answered Nov 25 '25 19:11

sunshinedrinker



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!