When converting a Pandas dataframe to a H2O frame using the h2o.H2OFrame() function an error is occurring.
Additional rows are being created in the H2o Frame. When I looked into this, it appears the new rows are duplicates of other rows. Depending on the data size the number of duplicate rows added varies, but typically around 2-10.
Code:
train_h2o = h2o.H2OFrame(python_obj=train_df_complete)
print(train_df_complete.shape[0])
print(train_h2o.nrow)
Output:
3871998
3872000
As you can see here, 2 additional rows have being added. When studied closer there are now 2 rows per user for 2 of the users. I.e. 2 rows have being duplicated.
This appears to be a major bug, does anyone have experience of this problem and is there a way to fix it?
Thanks
I had the same issue, assume your "train_h2o" does not have duplicates, just identify the index of the duplicates in dataframe and remove it. Unfortunately, the h2o Dataframe has limited functionality.
temp_df = train_h2o.as_data_frame()
train_h2o = train_h2o.drop(list(temp_df[temp_df.duplicated()].index), axis=0)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With