Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does sklearn use pandas index as a feature?

I'm passing a pandas DataFrame containing various features to sklearn and I do not want the estimator to use the dataframe index as one of the features. Does sklearn use the index as one of the features?

df_features = pd.DataFrame(columns=["feat1", "feat2", "target"])
# Populate the dataframe (not shown here)
y = df_features["target"]
X = df_features.drop(columns=["target"])

estimator = RandomForestClassifier()
estimator.fit(X, y)
like image 735
steve Avatar asked Oct 31 '19 00:10

steve


People also ask

What is the use of pandas index?

Index is like an address, that's how any data point across the dataframe or series can be accessed. Rows and columns both have indexes, rows indices are called as index and for columns its general column names. Pandas have three data structures dataframe, series & panel.

What does scikit-learn feature?

Through scikit-learn, we can implement various machine learning models for regression, classification, clustering, and statistical tools for analyzing these models. It also provides functionality for dimensionality reduction, feature selection, feature extraction, ensemble techniques, and inbuilt datasets.

Does Panda series have index?

Pandas series is a One-dimensional ndarray with axis labels. The labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Pandas Series.

What is pandas in Sklearn?

This module provides a bridge between Scikit-Learn's machine learning methods and pandas-style Data Frames. In particular, it provides a way to map DataFrame columns to transformations, which are later recombined into features.


1 Answers

No, sklearn doesn't use the index as one of your feature. It essentially happens here, when you call the fit method the check_array function will be applied. And now if you dig deep into the check_array function, you can find that you are converting your input into array using np.array function which essentially strips the indices from your dataframe as shown below:

import pandas as pd 
import numpy as np
data = [['tom', 10], ['nick', 15], ['juli', 14]] 
df = pd.DataFrame(data, columns = ['Name', 'Age']) 
df  

    Name    Age
0   tom     10
1   nick    15
2   juli    14

np.array(df)
array([['tom', 10],
       ['nick', 15],
       ['juli', 14]], dtype=object)

Hope this helps!

like image 111
Parthasarathy Subburaj Avatar answered Sep 21 '22 19:09

Parthasarathy Subburaj