If memory servies me, in R there is a data type called factor which when used within a DataFrame can be automatically unpacked into the necessary columns of a regression design matrix. For example, a factor containing True/False/Maybe values would be transformed into:
1 0 0
0 1 0
or
0 0 1
for the purpose of using lower level regression code. Is there a way to achieve something similar using the pandas library? I see that there is some regression support within Pandas, but since I have my own customised regression routines I am really interested in the construction of the design matrix (a 2d numpy array or matrix) from heterogeneous data with support for mapping back and fort between columns of the numpy object and the Pandas DataFrame from which it is derived.
Update: Here is an example of a data matrix with heterogeneous data of the sort I am thinking of (the example comes from the Pandas manual):
>>> df2 = DataFrame({'a' : ['one', 'one', 'two', 'three', 'two', 'one', 'six'],'b' : ['x', 'y', 'y', 'x', 'y', 'x', 'x'],'c' : np.random.randn(7)})
>>> df2
a b c
0 one x 0.000343
1 one y -0.055651
2 two y 0.249194
3 three x -1.486462
4 two y -0.406930
5 one x -0.223973
6 six x -0.189001
>>>
The 'a' column should be converted into 4 floating point columns (in spite of the meaning, there are only four unique atoms), the 'b' column can be converted to a single floating point column, and the 'c' column should be an unmodified final column in the design matrix.
Thanks,
SetJmp
We can create a dataframe in R by passing the variable a,b,c,d into the data. frame() function. We can R create dataframe and name the columns with name() and simply specify the name of the variables.
Pandas DataFrame: transpose() functionThe transpose() function is used to transpose index and columns. Reflect the DataFrame over its main diagonal by writing rows as columns and vice-versa. If True, the underlying data is copied. Otherwise (default), no copy is made if possible.
Data Matrix used in XGBoost. DMatrix is an internal data structure that is used by XGBoost, which is optimized for both memory efficiency and training speed. You can construct DMatrix from multiple different sources of data.
There is a new module called patsy that solves this problem. The quickstart linked below solves exactly the problem described above in a couple lines of code.
http://patsy.readthedocs.org/en/latest/overview.html
http://patsy.readthedocs.org/en/latest/quickstart.html
Here is an example usage:
import pandas
import patsy
dataFrame = pandas.io.parsers.read_csv("salary2.txt")
#salary2.txt is a re-formatted data set from the textbook
#Introductory Econometrics: A Modern Approach
#by Jeffrey Wooldridge
y,X = patsy.dmatrices("sl ~ 1+sx+rk+yr+dg+yd",dataFrame)
#X.design_info provides the meta data behind the X columns
print X.design_info
generates:
> DesignInfo(['Intercept',
> 'sx[T.male]',
> 'rk[T.associate]',
> 'rk[T.full]',
> 'dg[T.masters]',
> 'yr',
> 'yd'],
> term_slices=OrderedDict([(Term([]), slice(0, 1, None)), (Term([EvalFactor('sx')]), slice(1, 2, None)),
> (Term([EvalFactor('rk')]), slice(2, 4, None)),
> (Term([EvalFactor('dg')]), slice(4, 5, None)),
> (Term([EvalFactor('yr')]), slice(5, 6, None)),
> (Term([EvalFactor('yd')]), slice(6, 7, None))]),
> builder=<patsy.build.DesignMatrixBuilder at 0x10f169510>)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With