Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I discretize values in a pandas DataFrame and convert to a binary matrix?

I mean something like this:

I have a DataFrame with columns that may be categorical or nominal. For each observation (row), I want to generate a new row where every possible value for the variables is now its own binary variable. For example, this matrix (first row is column labels)

'a'     'b'     'c'
one     0.2     0
two     0.4     1
two     0.9     0
three   0.1     2
one     0.0     4
two     0.2     5

would be converted into something like this:

'a'              'b'                                                    'c'
one  two  three  [0.0,0.2)  [0.2,0.4)  [0.4,0.6)  [0.6,0.8)  [0.8,1.0]   0   1   2   3   4   5

 1    0     0        0          1          0          0          0       1   0   0   0   0   0
 0    1     0        0          0          0          0          1       0   1   0   0   0   0
 0    1     0        0          0          0          0          1       1   0   0   0   0   0
 0    0     1        1          0          0          0          0       0   0   1   0   0   0
 1    0     0        1          0          0          0          0       0   0   0   0   1   0
 0    1     0        0          1          0          0          0       0   0   0   0   0   1

Each variable (column) in the initial matrix get binned into all the possible values. If it's categorical, then each possible value becomes a new column. If it's a float, then the values are binned some way (say, always splitting into 10 bins). If it's an int, then it can be every possibel int value, or perhaps also binning.

FYI: in my real application, the table has up to 2 million rows, and the full "expanded" matrix may have hundreds of columns.

Is there an easy way to perform this operation?

Separately, I would also be willing to skip this step, as I am really trying to compute a Burt table (which is a symmetric matrix of the cross-tabulations). Is there an easy way to do something similar with the crosstab function? Otherwise, computing the cross tabulation is just a simple matrix multiplication.

like image 889
Uri Laserson Avatar asked May 29 '12 00:05

Uri Laserson


People also ask

How do you convert a DataFrame to a matrix in Python?

A two-dimensional rectangular array to store data in rows and columns is called python matrix. Matrix is a Numpy array to store data in rows and columns. Using dataframe. to_numpy() method we can convert dataframe to Numpy Matrix.

What does Astype do in pandas?

Pandas DataFrame astype() Method The astype() method returns a new DataFrame where the data types has been changed to the specified type.

How do you convert DataFrame values to integers?

Convert Column to int (Integer)Use pandas DataFrame. astype() function to convert column to int (integer), you can apply this on a specific column or on an entire DataFrame. To cast the data type to 64-bit signed integer, you can use numpy. int64 , numpy.


2 Answers

Note that I have implemented new cut and qcut functions for discretizing continuous data:

http://pandas-docs.github.io/pandas-docs-travis/basics.html#discretization-and-quantiling

like image 83
Wes McKinney Avatar answered Sep 30 '22 20:09

Wes McKinney


For labeled columns like the a and c column in your example you can use the pandas build-in method get_dummies().

Ex.:

import pandas as pd
s1 = ['a', 'b', np.nan]
pd.get_dummies(s1)
       a  b
    0  1  0
    1  0  1
    2  0  0
like image 38
wonderkid2 Avatar answered Sep 30 '22 20:09

wonderkid2