I'm passing a pandas DataFrame containing various features to sklearn and I do not want the estimator to use the dataframe index as one of the features. Does sklearn use the index as one of the features?
df_features = pd.DataFrame(columns=["feat1", "feat2", "target"])
# Populate the dataframe (not shown here)
y = df_features["target"]
X = df_features.drop(columns=["target"])
estimator = RandomForestClassifier()
estimator.fit(X, y)
Index is like an address, that's how any data point across the dataframe or series can be accessed. Rows and columns both have indexes, rows indices are called as index and for columns its general column names. Pandas have three data structures dataframe, series & panel.
Through scikit-learn, we can implement various machine learning models for regression, classification, clustering, and statistical tools for analyzing these models. It also provides functionality for dimensionality reduction, feature selection, feature extraction, ensemble techniques, and inbuilt datasets.
Pandas series is a One-dimensional ndarray with axis labels. The labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Pandas Series.
This module provides a bridge between Scikit-Learn's machine learning methods and pandas-style Data Frames. In particular, it provides a way to map DataFrame columns to transformations, which are later recombined into features.
No, sklearn
doesn't use the index as one of your feature. It essentially happens here, when you call the fit method the check_array
function will be applied. And now if you dig deep into the check_array
function, you can find that you are converting your input into array using np.array
function which essentially strips the indices from your dataframe as shown below:
import pandas as pd
import numpy as np
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
df
Name Age
0 tom 10
1 nick 15
2 juli 14
np.array(df)
array([['tom', 10],
['nick', 15],
['juli', 14]], dtype=object)
Hope this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With