I mean something like this:
I have a DataFrame
with columns that may be categorical or nominal. For each observation (row), I want to generate a new row where every possible value for the variables is now its own binary variable. For example, this matrix (first row is column labels)
'a' 'b' 'c'
one 0.2 0
two 0.4 1
two 0.9 0
three 0.1 2
one 0.0 4
two 0.2 5
would be converted into something like this:
'a' 'b' 'c'
one two three [0.0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1.0] 0 1 2 3 4 5
1 0 0 0 1 0 0 0 1 0 0 0 0 0
0 1 0 0 0 0 0 1 0 1 0 0 0 0
0 1 0 0 0 0 0 1 1 0 0 0 0 0
0 0 1 1 0 0 0 0 0 0 1 0 0 0
1 0 0 1 0 0 0 0 0 0 0 0 1 0
0 1 0 0 1 0 0 0 0 0 0 0 0 1
Each variable (column) in the initial matrix get binned into all the possible values. If it's categorical, then each possible value becomes a new column. If it's a float, then the values are binned some way (say, always splitting into 10 bins). If it's an int, then it can be every possibel int value, or perhaps also binning.
FYI: in my real application, the table has up to 2 million rows, and the full "expanded" matrix may have hundreds of columns.
Is there an easy way to perform this operation?
Separately, I would also be willing to skip this step, as I am really trying to compute a Burt table (which is a symmetric matrix of the cross-tabulations). Is there an easy way to do something similar with the crosstab
function? Otherwise, computing the cross tabulation is just a simple matrix multiplication.
A two-dimensional rectangular array to store data in rows and columns is called python matrix. Matrix is a Numpy array to store data in rows and columns. Using dataframe. to_numpy() method we can convert dataframe to Numpy Matrix.
Pandas DataFrame astype() Method The astype() method returns a new DataFrame where the data types has been changed to the specified type.
Convert Column to int (Integer)Use pandas DataFrame. astype() function to convert column to int (integer), you can apply this on a specific column or on an entire DataFrame. To cast the data type to 64-bit signed integer, you can use numpy. int64 , numpy.
Note that I have implemented new cut
and qcut
functions for discretizing continuous data:
http://pandas-docs.github.io/pandas-docs-travis/basics.html#discretization-and-quantiling
For labeled columns like the a
and c
column in your example you can use the pandas build-in method get_dummies().
Ex.:
import pandas as pd
s1 = ['a', 'b', np.nan]
pd.get_dummies(s1)
a b
0 1 0
1 0 1
2 0 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With