How do I discretize values in a pandas DataFrame and convert to a binary matrix?

Tags:

I mean something like this:

I have a DataFrame with columns that may be categorical or nominal. For each observation (row), I want to generate a new row where every possible value for the variables is now its own binary variable. For example, this matrix (first row is column labels)

'a'     'b'     'c'
one     0.2     0
two     0.4     1
two     0.9     0
three   0.1     2
one     0.0     4
two     0.2     5

would be converted into something like this:

'a'              'b'                                                    'c'
one  two  three  [0.0,0.2)  [0.2,0.4)  [0.4,0.6)  [0.6,0.8)  [0.8,1.0]   0   1   2   3   4   5

 1    0     0        0          1          0          0          0       1   0   0   0   0   0
 0    1     0        0          0          0          0          1       0   1   0   0   0   0
 0    1     0        0          0          0          0          1       1   0   0   0   0   0
 0    0     1        1          0          0          0          0       0   0   1   0   0   0
 1    0     0        1          0          0          0          0       0   0   0   0   1   0
 0    1     0        0          1          0          0          0       0   0   0   0   0   1

Each variable (column) in the initial matrix get binned into all the possible values. If it's categorical, then each possible value becomes a new column. If it's a float, then the values are binned some way (say, always splitting into 10 bins). If it's an int, then it can be every possibel int value, or perhaps also binning.

FYI: in my real application, the table has up to 2 million rows, and the full "expanded" matrix may have hundreds of columns.

Is there an easy way to perform this operation?

Separately, I would also be willing to skip this step, as I am really trying to compute a Burt table (which is a symmetric matrix of the cross-tabulations). Is there an easy way to do something similar with the crosstab function? Otherwise, computing the cross tabulation is just a simple matrix multiplication.

889

asked May 29 '12 00:05

Uri Laserson

2 Answers

Note that I have implemented new cut and qcut functions for discretizing continuous data:

http://pandas-docs.github.io/pandas-docs-travis/basics.html#discretization-and-quantiling

answered Sep 30 '22 20:09

Wes McKinney

For labeled columns like the a and c column in your example you can use the pandas build-in method get_dummies().

Ex.:

import pandas as pd
s1 = ['a', 'b', np.nan]
pd.get_dummies(s1)
       a  b
    0  1  0
    1  0  1
    2  0  0

answered Sep 30 '22 20:09

wonderkid2

Related questions
                            
                                Sum up all the integers in range()
                            
                                Deploy to AWS EB failing because of YAML error in python.config
                            
                                Is any() evaluated lazily?
                            
                                How do I coalesce a sequence of identical characters into just one?
                            
                                Python Running cumulative sum with a given window
                            
                                Django Test framework with file based Email backend server
                            
                                Sum of products of pairs in a list
                            
                                Latency command in Discord.py
                            
                                Running compiled python (py2exe) as administrator in Vista
                            
                                Filter a list to only leave objects that occur once
                            
                                Understanding recursion in Python
                            
                                How to verify that two images are exactly identical?
                            
                                drop_duplicates not working in pandas?
                            
                                What's the Pythonic way to combine two sequences into a dictionary?
                            
                                Finding largest value in a dictionary [duplicate]
                            
                                How can I remove multiple characters in a list?
                            
                                Simplest method of asking user for password using graphical dialog in Python?
                            
                                Installing hstore extension in django nose tests
                            
                                create a raw python file in jupyter notebook
                            
                                python subprocess output to list or file

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I discretize values in a pandas DataFrame and convert to a binary matrix?

Tags:

python

pandas

dataframe

Uri Laserson

People also ask

2 Answers

Wes McKinney

wonderkid2

Recent Activity

Donate For Us