I'm passing a pandas DataFrame containing various features to sklearn and I do not want the estimator to use the dataframe index as one of the features. Does sklearn use the index as one of the features? <pre class="prettyprint"><code>df_features = pd.DataFrame(columns=["feat1", "feat2", "target"]) # Populate the dataframe (not shown here) y = df_features["target"] X = df_features.drop(columns=["target"]) estimator = RandomForestClassifier() estimator.fit(X, y) </code></pre>

No, <code>sklearn</code> doesn't use the index as one of your feature. It essentially happens here, when you call the fit method the <code>check_array</code> function will be applied. And now if you dig deep into the <code>check_array</code> function, you can find that you are converting your input into array using <code>np.array</code> function which essentially strips the indices from your dataframe as shown below: <pre class="prettyprint"><code>import pandas as pd import numpy as np data = [['tom', 10], ['nick', 15], ['juli', 14]] df = pd.DataFrame(data, columns = ['Name', 'Age']) df Name Age 0 tom 10 1 nick 15 2 juli 14 np.array(df) array([['tom', 10], ['nick', 15], ['juli', 14]], dtype=object) </code></pre> Hope this helps!

Does sklearn use pandas index as a feature?

Tags:

pandas

scikit-learn

I'm passing a pandas DataFrame containing various features to sklearn and I do not want the estimator to use the dataframe index as one of the features. Does sklearn use the index as one of the features?

df_features = pd.DataFrame(columns=["feat1", "feat2", "target"])
# Populate the dataframe (not shown here)
y = df_features["target"]
X = df_features.drop(columns=["target"])

estimator = RandomForestClassifier()
estimator.fit(X, y)

735

asked Oct 31 '19 00:10

steve

1 Answers

No, sklearn doesn't use the index as one of your feature. It essentially happens here, when you call the fit method the check_array function will be applied. And now if you dig deep into the check_array function, you can find that you are converting your input into array using np.array function which essentially strips the indices from your dataframe as shown below:

import pandas as pd 
import numpy as np
data = [['tom', 10], ['nick', 15], ['juli', 14]] 
df = pd.DataFrame(data, columns = ['Name', 'Age']) 
df  

    Name    Age
0   tom     10
1   nick    15
2   juli    14

np.array(df)
array([['tom', 10],
       ['nick', 15],
       ['juli', 14]], dtype=object)

Hope this helps!

111

answered Sep 21 '22 19:09

Parthasarathy Subburaj

Related questions
                            
                                Fastest way to replace part of a string in Pandas series if it contains a word in a list
                            
                                Geopandas set geometry: ValueError for MultiPolygon "equal len keys and value"
                            
                                Taking the first records for each group in pandas dataframe and putting 0 in other records
                            
                                Check if String in List of Strings is in Pandas DataFrame Column
                            
                                Dropping duplicate records ignoring case
                            
                                Compare two dataframe columns for matching percentage
                            
                                How can I detect common elements lists and groupe lists with at least 1 common element?
                            
                                How to merge rows in dataframe with different columns?
                            
                                Reindex specific level of pandas MultiIndex
                            
                                Change value of uneven to specific even numbers
                            
                                How to find minimum value in a column based on condition in an another column of a dataframe?
                            
                                Looking for simpler solution to group by and select rows in pandas
                            
                                How to use pandas tz_convert to convert to multiple different time zones
                            
                                Cannot import rpy2.robjects after updating pandas "ValueError: The system "%s" is not supported."
                            
                                How to copy the current row and the next row value in a new dataframe using python?
                            
                                Split dataframe by rows and generate list of dataframes in python
                            
                                DASK: Typerrror: Column assignment doesn't support type numpy.ndarray whereas Pandas works fine
                            
                                How to use map with a dictionary having regular expression keys?
                            
                                How to create a increment var from a first value of a dataframe group?
                            
                                How do I make the width of the title box span the entire plot?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With