Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Pandas: how to turn a DataFrame with "factors" into a design matrix for linear regression?

If memory servies me, in R there is a data type called factor which when used within a DataFrame can be automatically unpacked into the necessary columns of a regression design matrix. For example, a factor containing True/False/Maybe values would be transformed into:

1 0 0
0 1 0
or
0 0 1

for the purpose of using lower level regression code. Is there a way to achieve something similar using the pandas library? I see that there is some regression support within Pandas, but since I have my own customised regression routines I am really interested in the construction of the design matrix (a 2d numpy array or matrix) from heterogeneous data with support for mapping back and fort between columns of the numpy object and the Pandas DataFrame from which it is derived.

Update: Here is an example of a data matrix with heterogeneous data of the sort I am thinking of (the example comes from the Pandas manual):

>>> df2 = DataFrame({'a' : ['one', 'one', 'two', 'three', 'two', 'one', 'six'],'b' : ['x', 'y', 'y', 'x', 'y', 'x', 'x'],'c' : np.random.randn(7)})
>>> df2
       a  b         c
0    one  x  0.000343
1    one  y -0.055651
2    two  y  0.249194
3  three  x -1.486462
4    two  y -0.406930
5    one  x -0.223973
6    six  x -0.189001
>>> 

The 'a' column should be converted into 4 floating point columns (in spite of the meaning, there are only four unique atoms), the 'b' column can be converted to a single floating point column, and the 'c' column should be an unmodified final column in the design matrix.

Thanks,

SetJmp

like image 688
Setjmp Avatar asked Apr 17 '12 18:04

Setjmp


People also ask

How do you create a variable from a data frame?

We can create a dataframe in R by passing the variable a,b,c,d into the data. frame() function. We can R create dataframe and name the columns with name() and simply specify the name of the variables.

How do you transpose a matrix in pandas?

Pandas DataFrame: transpose() functionThe transpose() function is used to transpose index and columns. Reflect the DataFrame over its main diagonal by writing rows as columns and vice-versa. If True, the underlying data is copied. Otherwise (default), no copy is made if possible.

What is Dmatrices?

Data Matrix used in XGBoost. DMatrix is an internal data structure that is used by XGBoost, which is optimized for both memory efficiency and training speed. You can construct DMatrix from multiple different sources of data.


1 Answers

There is a new module called patsy that solves this problem. The quickstart linked below solves exactly the problem described above in a couple lines of code.

  • http://patsy.readthedocs.org/en/latest/overview.html

  • http://patsy.readthedocs.org/en/latest/quickstart.html

Here is an example usage:

import pandas
import patsy

dataFrame = pandas.io.parsers.read_csv("salary2.txt") 
#salary2.txt is a re-formatted data set from the textbook
#Introductory Econometrics: A Modern Approach
#by Jeffrey Wooldridge
y,X = patsy.dmatrices("sl ~ 1+sx+rk+yr+dg+yd",dataFrame)
#X.design_info provides the meta data behind the X columns
print X.design_info

generates:

> DesignInfo(['Intercept',
>             'sx[T.male]',
>             'rk[T.associate]',
>             'rk[T.full]',
>             'dg[T.masters]',
>             'yr',
>             'yd'],
>            term_slices=OrderedDict([(Term([]), slice(0, 1, None)), (Term([EvalFactor('sx')]), slice(1, 2, None)),
> (Term([EvalFactor('rk')]), slice(2, 4, None)),
> (Term([EvalFactor('dg')]), slice(4, 5, None)),
> (Term([EvalFactor('yr')]), slice(5, 6, None)),
> (Term([EvalFactor('yd')]), slice(6, 7, None))]),
>            builder=<patsy.build.DesignMatrixBuilder at 0x10f169510>)
like image 123
Setjmp Avatar answered Sep 22 '22 17:09

Setjmp