I have a Pandas dataframe which has <code>Encoding: latin-1</code> and is delimited by <code>;</code>. The dataframe is very large almost of <code>size: 350000 x 3800</code>. I wanted to use sklearn initially but my dataframe has missing values (<code>NAN values</code>) so i could not use sklearn's random forests or GBM. So i had to use <code>H2O's</code> Distributed random forests for the Training of the dataset. The main Problem is the dataframe is not efficiently converted when i do <code>h2o.H2OFrame(data)</code>. I checked for the possibility for providing the Encoding Options but there is nothing in the documentation. Do anyone have an idea about this? Any leads could help me. I also want to know if there are any other libraries like H2O which can handle NAN values very efficiently? I know that we can impute the columns but i should not do that in my dataset because my columns are values from different sensors, if the values are not there implies that the sensor is not present. I can use only Python

<pre class="prettyprint"><code>import h2o import pandas as pd df = pd.DataFrame({'col1': [1,1,2], 'col2': ['César Chávez Day', 'César Chávez Day', 'César Chávez Day']}) hf = h2o.H2OFrame(df) </code></pre> Since the problem that you are facing is due to the high number of NANs in the dataset, this should be handled first. There are two ways to do so. <ol> <li>Replace <code>NAN</code> with a single, obviously out-of-range value. Ex. If a feature varies between 0-1 replace all <code>NAN</code> with -1 for that feature.</li> <li>Use the class Imputer to handle NAN values. This will replace <code>NAN</code> with either of mean, median or mode of that feature.</li> </ol>

conversion of pandas dataframe to h2o frame efficiently

Tags:

python

machine-learning

h2o

I have a Pandas dataframe which has Encoding: latin-1 and is delimited by ;. The dataframe is very large almost of size: 350000 x 3800. I wanted to use sklearn initially but my dataframe has missing values (NAN values) so i could not use sklearn's random forests or GBM. So i had to use H2O's Distributed random forests for the Training of the dataset. The main Problem is the dataframe is not efficiently converted when i do h2o.H2OFrame(data). I checked for the possibility for providing the Encoding Options but there is nothing in the documentation.

Do anyone have an idea about this? Any leads could help me. I also want to know if there are any other libraries like H2O which can handle NAN values very efficiently? I know that we can impute the columns but i should not do that in my dataset because my columns are values from different sensors, if the values are not there implies that the sensor is not present. I can use only Python

510

asked Oct 27 '17 09:10

ayaan

1 Answers

import h2o
import pandas as pd

df = pd.DataFrame({'col1': [1,1,2], 'col2': ['César Chávez Day', 'César Chávez Day', 'César Chávez Day']})
hf = h2o.H2OFrame(df)

Since the problem that you are facing is due to the high number of NANs in the dataset, this should be handled first. There are two ways to do so.

Replace NAN with a single, obviously out-of-range value. Ex. If a feature varies between 0-1 replace all NAN with -1 for that feature.
Use the class Imputer to handle NAN values. This will replace NAN with either of mean, median or mode of that feature.

154

answered Oct 06 '22 23:10

Anand C U

Related questions
                            
                                Python Pandas to R dataframe
                            
                                Why can't use semi-colon before for loop in Python?
                            
                                Converting timezones from pandas Timestamps
                            
                                Changes of clustering results after each time run in Python scikit-learn
                            
                                boto get md5 s3 file
                            
                                How to use sadd with multiple elements in Redis using Python API?
                            
                                python dataframe converting multiple datetime formats
                            
                                Run programs in background and redirect their outputs to file in real time
                            
                                Get the name or ID of the current Google Compute Instance
                            
                                Matplotlib - unable to save image in same resolution as original image
                            
                                Exception during list comprehension. Are intermediate results kept anywhere?
                            
                                How to set background image on Flask Templates?
                            
                                display matrix values and colormap
                            
                                Is there a way to generate the dtypes as a dictionary in pandas?
                            
                                plotly inside jupyter notebook python
                            
                                ValueError: x and y must be the same size
                            
                                How to connect remote mongodb with pymongo
                            
                                How to check if a particular cell in pandas DataFrame isnull?
                            
                                Duck typing with python 3.5 style type-annotations
                            
                                Set chrome browser binary through chromedriver in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With