I have a pandas data frame which has some rows and columns. Each column has a header. Now as long as I keep doing data manipulation operations in pandas, my variable headers are retained. But if I try some data pre-processing feature of Sci-kit-learn lib, I end up losing all my headers and the frame gets converted to just a matrix of numbers. I understand why it happens because scikit-learn gives a numpy ndarray as output. And numpy ndarray being just matrix would not have column names. But here is the thing. If I am building some model on my dataset, even after initial data pre-processing and trying some model, I might have to do some more data manipulation tasks to run some other model for better fit. Without being able to access column header makes it difficult to do data manipulation as I might not know what is the index of a particular variable, but it's easier to remember variable name or even look up by doing df.columns. How to overcome that? EDIT1: Editing with sample data snapshot. <pre class="prettyprint"><code> Pclass Sex Age SibSp Parch Fare Embarked 0 3 0 22 1 0 7.2500 1 1 1 1 38 1 0 71.2833 2 2 3 1 26 0 0 7.9250 1 3 1 1 35 1 0 53.1000 1 4 3 0 35 0 0 8.0500 1 5 3 0 NaN 0 0 8.4583 3 6 1 0 54 0 0 51.8625 1 7 3 0 2 3 1 21.0750 1 8 3 1 27 0 2 11.1333 1 9 2 1 14 1 0 30.0708 2 10 3 1 4 1 1 16.7000 1 11 1 1 58 0 0 26.5500 1 12 3 0 20 0 0 8.0500 1 13 3 0 39 1 5 31.2750 1 14 3 1 14 0 0 7.8542 1 15 2 1 55 0 0 16.0000 1 </code></pre> The above is basically the pandas data frame. Now when I do this on this data frame it will strip the column headers. <pre class="prettyprint"><code>from sklearn import preprocessing X_imputed=preprocessing.Imputer().fit_transform(X_train) X_imputed </code></pre> New data is of numpy array and hence the column names are stripped. <pre class="prettyprint"><code>array([[ 3. , 0. , 22. , ..., 0. , 7.25 , 1. ], [ 1. , 1. , 38. , ..., 0. , 71.2833 , 2. ], [ 3. , 1. , 26. , ..., 0. , 7.925 , 1. ], ..., [ 3. , 1. , 29.69911765, ..., 2. , 23.45 , 1. ], [ 1. , 0. , 26. , ..., 0. , 30. , 2. ], [ 3. , 0. , 32. , ..., 0. , 7.75 , 3. ]]) </code></pre> So I want to retain the column names when I do some data manipulation on my pandas data frame.

scikit-learn indeed strips the column headers in most cases, so just add them back on afterward. In your example, with <code>X_imputed</code> as the <code>sklearn.preprocessing</code> output and <code>X_train</code> as the original dataframe, you can put the column headers back on with: <pre class="prettyprint"><code>X_imputed_df = pd.DataFrame(X_imputed, columns = X_train.columns) </code></pre>

The above answers still do not resolve the main question. There are two implicit assumptions here <ol> <li>That all the features of the dataset will be retained which might not be true. E.g. some kind of feature selection function.</li> <li>That all the features will be retained in the same order, again there might be implicit sorting in some feature selection transformations.</li> </ol> There is a "get_support()" method in at least some of the fit and transform functions that save the information on which columns(features) are retained and in what order. You can check the basics of the function and how to use it here ... Find get_support() function description here This would be the most preferred and official way to get the information needed here.

How to retain column headers of data frame after Pre-processing in scikit-learn

Tags:

python

pandas

numpy

scikit-learn

I have a pandas data frame which has some rows and columns. Each column has a header. Now as long as I keep doing data manipulation operations in pandas, my variable headers are retained. But if I try some data pre-processing feature of Sci-kit-learn lib, I end up losing all my headers and the frame gets converted to just a matrix of numbers.

I understand why it happens because scikit-learn gives a numpy ndarray as output. And numpy ndarray being just matrix would not have column names.

But here is the thing. If I am building some model on my dataset, even after initial data pre-processing and trying some model, I might have to do some more data manipulation tasks to run some other model for better fit. Without being able to access column header makes it difficult to do data manipulation as I might not know what is the index of a particular variable, but it's easier to remember variable name or even look up by doing df.columns.

How to overcome that?

EDIT1: Editing with sample data snapshot.

    Pclass  Sex Age SibSp   Parch   Fare    Embarked 0   3   0   22  1   0   7.2500  1 1   1   1   38  1   0   71.2833 2 2   3   1   26  0   0   7.9250  1 3   1   1   35  1   0   53.1000 1 4   3   0   35  0   0   8.0500  1 5   3   0   NaN 0   0   8.4583  3 6   1   0   54  0   0   51.8625 1 7   3   0   2   3   1   21.0750 1 8   3   1   27  0   2   11.1333 1 9   2   1   14  1   0   30.0708 2 10  3   1   4   1   1   16.7000 1 11  1   1   58  0   0   26.5500 1 12  3   0   20  0   0   8.0500  1 13  3   0   39  1   5   31.2750 1 14  3   1   14  0   0   7.8542  1 15  2   1   55  0   0   16.0000 1

The above is basically the pandas data frame. Now when I do this on this data frame it will strip the column headers.

from sklearn import preprocessing  X_imputed=preprocessing.Imputer().fit_transform(X_train)  X_imputed

New data is of numpy array and hence the column names are stripped.

array([[  3.        ,   0.        ,  22.        , ...,   0.        ,           7.25      ,   1.        ],        [  1.        ,   1.        ,  38.        , ...,   0.        ,          71.2833    ,   2.        ],        [  3.        ,   1.        ,  26.        , ...,   0.        ,           7.925     ,   1.        ],        ...,         [  3.        ,   1.        ,  29.69911765, ...,   2.        ,          23.45      ,   1.        ],        [  1.        ,   0.        ,  26.        , ...,   0.        ,          30.        ,   2.        ],        [  3.        ,   0.        ,  32.        , ...,   0.        ,           7.75      ,   3.        ]])

So I want to retain the column names when I do some data manipulation on my pandas data frame.

756

asked Apr 12 '15 05:04

Baktaawar

2 Answers

scikit-learn indeed strips the column headers in most cases, so just add them back on afterward. In your example, with X_imputed as the sklearn.preprocessing output and X_train as the original dataframe, you can put the column headers back on with:

X_imputed_df = pd.DataFrame(X_imputed, columns = X_train.columns)

156

answered Sep 30 '22 11:09

selwyth

The above answers still do not resolve the main question. There are two implicit assumptions here

That all the features of the dataset will be retained which might not be true. E.g. some kind of feature selection function.
That all the features will be retained in the same order, again there might be implicit sorting in some feature selection transformations.

There is a "get_support()" method in at least some of the fit and transform functions that save the information on which columns(features) are retained and in what order.

You can check the basics of the function and how to use it here ... Find get_support() function description here

This would be the most preferred and official way to get the information needed here.

answered Sep 30 '22 12:09

Vineet Agarwal

Related questions
                            
                                Why is linear read-shuffled write not faster than shuffled read-linear write?
                            
                                pytest using fixtures as arguments in parametrize
                            
                                How to implement custom indentation when pretty-printing with the JSON module?
                            
                                Python Requests: Post JSON and file in single request
                            
                                Why does PyCharm use 120 Character Lines even though PEP8 Specifies 79?
                            
                                Why does python use two underscores for certain things?
                            
                                Using Sql Server with Django in production
                            
                                Argparse"ArgumentError: argument -h/--help: conflicting option string(s): -h, --help"
                            
                                How to get scalar value on a cell using conditional indexing
                            
                                How do I add python libraries to an AWS lambda function for Alexa?
                            
                                Why does Python's itertools.permutations contain duplicates? (When the original list has duplicates)
                            
                                Should I avoid converting to a string if a value is already a string?
                            
                                Celery: When should you choose Redis as a message broker over RabbitMQ?
                            
                                Python Compilation/Interpretation Process
                            
                                Subclass dict: UserDict, dict or ABC?
                            
                                What does this socket.gaierror mean?
                            
                                Pandas Resampling error: Only valid with DatetimeIndex or PeriodIndex
                            
                                Could not build wheels since package wheel is not installed
                            
                                Using 100% of all cores with the multiprocessing module
                            
                                Automatically Rescale ylim and xlim in Matplotlib

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With