I have a pandas data frame which has some rows and columns. Each column has a header. Now as long as I keep doing data manipulation operations in pandas, my variable headers are retained. But if I try some data pre-processing feature of Sci-kit-learn lib, I end up losing all my headers and the frame gets converted to just a matrix of numbers.
I understand why it happens because scikit-learn gives a numpy ndarray as output. And numpy ndarray being just matrix would not have column names.
But here is the thing. If I am building some model on my dataset, even after initial data pre-processing and trying some model, I might have to do some more data manipulation tasks to run some other model for better fit. Without being able to access column header makes it difficult to do data manipulation as I might not know what is the index of a particular variable, but it's easier to remember variable name or even look up by doing df.columns.
How to overcome that?
EDIT1: Editing with sample data snapshot.
Pclass Sex Age SibSp Parch Fare Embarked 0 3 0 22 1 0 7.2500 1 1 1 1 38 1 0 71.2833 2 2 3 1 26 0 0 7.9250 1 3 1 1 35 1 0 53.1000 1 4 3 0 35 0 0 8.0500 1 5 3 0 NaN 0 0 8.4583 3 6 1 0 54 0 0 51.8625 1 7 3 0 2 3 1 21.0750 1 8 3 1 27 0 2 11.1333 1 9 2 1 14 1 0 30.0708 2 10 3 1 4 1 1 16.7000 1 11 1 1 58 0 0 26.5500 1 12 3 0 20 0 0 8.0500 1 13 3 0 39 1 5 31.2750 1 14 3 1 14 0 0 7.8542 1 15 2 1 55 0 0 16.0000 1
The above is basically the pandas data frame. Now when I do this on this data frame it will strip the column headers.
from sklearn import preprocessing X_imputed=preprocessing.Imputer().fit_transform(X_train) X_imputed
New data is of numpy array and hence the column names are stripped.
array([[ 3. , 0. , 22. , ..., 0. , 7.25 , 1. ], [ 1. , 1. , 38. , ..., 0. , 71.2833 , 2. ], [ 3. , 1. , 26. , ..., 0. , 7.925 , 1. ], ..., [ 3. , 1. , 29.69911765, ..., 2. , 23.45 , 1. ], [ 1. , 0. , 26. , ..., 0. , 30. , 2. ], [ 3. , 0. , 32. , ..., 0. , 7.75 , 3. ]])
So I want to retain the column names when I do some data manipulation on my pandas data frame.
To get a list of columns from the DataFrame header use DataFrame. columns. values. tolist() method.
A header necessarily stores the names or headings for each of the columns. It basically helps the user to identify the role of the respective column in the data frame. The top row containing column names is called the header row of the data frame.
Custom Memory Management: In RDDs, the data is stored in memory, whereas DataFrames store data off-heap (outside the main Java Heap space, but still inside RAM), which in turn reduces the garbage collection overload.
To get the nth row in a Pandas DataFrame, we can use the iloc() method. For example, df. iloc[4] will return the 5th row because row numbers start from 0.
scikit-learn indeed strips the column headers in most cases, so just add them back on afterward. In your example, with X_imputed
as the sklearn.preprocessing
output and X_train
as the original dataframe, you can put the column headers back on with:
X_imputed_df = pd.DataFrame(X_imputed, columns = X_train.columns)
The above answers still do not resolve the main question. There are two implicit assumptions here
There is a "get_support()" method in at least some of the fit and transform functions that save the information on which columns(features) are retained and in what order.
You can check the basics of the function and how to use it here ... Find get_support() function description here
This would be the most preferred and official way to get the information needed here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With