H2OFrame() in Python is adding additional duplicate rows to the Pandas DataFrame- Bug?

Question

When converting a Pandas dataframe to a H2O frame using the h2o.H2OFrame() function an error is occurring.

Additional rows are being created in the H2o Frame. When I looked into this, it appears the new rows are duplicates of other rows. Depending on the data size the number of duplicate rows added varies, but typically around 2-10.

Code:

train_h2o = h2o.H2OFrame(python_obj=train_df_complete)

print(train_df_complete.shape[0])
print(train_h2o.nrow)

Output:

3871998
3872000

As you can see here, 2 additional rows have being added. When studied closer there are now 2 rows per user for 2 of the users. I.e. 2 rows have being duplicated.

This appears to be a major bug, does anyone have experience of this problem and is there a way to fix it?

Thanks

Alex G · Accepted Answer

I had the same issue, assume your "train_h2o" does not have duplicates, just identify the index of the duplicates in dataframe and remove it. Unfortunately, the h2o Dataframe has limited functionality.

temp_df = train_h2o.as_data_frame()
train_h2o = train_h2o.drop(list(temp_df[temp_df.duplicated()].index), axis=0)

H2OFrame() in Python is adding additional duplicate rows to the Pandas DataFrame- Bug?

Tags:

python

python-3.x

pandas

h2o

George

1 Answers

Alex G

Recent Activity

Donate For Us

H2OFrame() in Python is adding additional duplicate rows to the Pandas DataFrame- Bug?

Tags:

python

python-3.x

pandas

h2o

George

1 Answers

Alex G

Related questions

Recent Activity

Donate For Us