Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

H2OFrame() in Python is adding additional duplicate rows to the Pandas DataFrame- Bug?

When converting a Pandas dataframe to a H2O frame using the h2o.H2OFrame() function an error is occurring.

Additional rows are being created in the H2o Frame. When I looked into this, it appears the new rows are duplicates of other rows. Depending on the data size the number of duplicate rows added varies, but typically around 2-10.

Code:

train_h2o = h2o.H2OFrame(python_obj=train_df_complete)

print(train_df_complete.shape[0])
print(train_h2o.nrow)

Output:

3871998
3872000

As you can see here, 2 additional rows have being added. When studied closer there are now 2 rows per user for 2 of the users. I.e. 2 rows have being duplicated.

This appears to be a major bug, does anyone have experience of this problem and is there a way to fix it?

Thanks

like image 863
George Avatar asked Aug 14 '17 10:08

George


1 Answers

I had the same issue, assume your "train_h2o" does not have duplicates, just identify the index of the duplicates in dataframe and remove it. Unfortunately, the h2o Dataframe has limited functionality.

temp_df = train_h2o.as_data_frame()
train_h2o = train_h2o.drop(list(temp_df[temp_df.duplicated()].index), axis=0)
like image 74
Alex G Avatar answered Oct 19 '22 16:10

Alex G