Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

conversion of pandas dataframe to h2o frame efficiently

I have a Pandas dataframe which has Encoding: latin-1 and is delimited by ;. The dataframe is very large almost of size: 350000 x 3800. I wanted to use sklearn initially but my dataframe has missing values (NAN values) so i could not use sklearn's random forests or GBM. So i had to use H2O's Distributed random forests for the Training of the dataset. The main Problem is the dataframe is not efficiently converted when i do h2o.H2OFrame(data). I checked for the possibility for providing the Encoding Options but there is nothing in the documentation.

Do anyone have an idea about this? Any leads could help me. I also want to know if there are any other libraries like H2O which can handle NAN values very efficiently? I know that we can impute the columns but i should not do that in my dataset because my columns are values from different sensors, if the values are not there implies that the sensor is not present. I can use only Python

like image 510
ayaan Avatar asked Oct 27 '17 09:10

ayaan


People also ask

What is the most efficient way to loop through Dataframes with pandas?

Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.

Is pandas efficient for large data sets?

The default pandas data types are not the most memory efficient. This is especially true for text data columns with relatively few unique values (commonly referred to as “low-cardinality” data). By using more efficient data types, you can store larger datasets in memory.


1 Answers

import h2o
import pandas as pd

df = pd.DataFrame({'col1': [1,1,2], 'col2': ['César Chávez Day', 'César Chávez Day', 'César Chávez Day']})
hf = h2o.H2OFrame(df)

Since the problem that you are facing is due to the high number of NANs in the dataset, this should be handled first. There are two ways to do so.

  1. Replace NAN with a single, obviously out-of-range value. Ex. If a feature varies between 0-1 replace all NAN with -1 for that feature.

  2. Use the class Imputer to handle NAN values. This will replace NAN with either of mean, median or mode of that feature.

like image 154
Anand C U Avatar answered Oct 06 '22 23:10

Anand C U