Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Pandas: how to add a totally new column to a data frame inside of a groupby/transform operation

I want to mark some quantiles in my data, and for each row of the DataFrame, I would like the entry in a new column called e.g. "xtile" to hold this value.

For example, suppose I create a data frame like this:

import pandas, numpy as np dfrm = pandas.DataFrame({'A':np.random.rand(100),                           'B':(50+np.random.randn(100)),                           'C':np.random.randint(low=0, high=3, size=(100,))}) 

And let's say I write my own function to compute the quintile of each element in an array. I have my own function for this, but for example just refer to scipy.stats.mstats.mquantile.

import scipy.stats as st def mark_quintiles(x, breakpoints):     # Assume this is filled in, using st.mstats.mquantiles.     # This returns an array the same shape as x, with an integer for which     # breakpoint-bucket that entry of x falls into. 

Now, the real question is how to use transform to add a new column to the data. Something like this:

def transformXtiles(dataFrame, inputColumnName, newColumnName, breaks):     dataFrame[newColumnName] = mark_quintiles(dataFrame[inputColumnName].values,                                                breaks)     return dataFrame 

And then:

dfrm.groupby("C").transform(lambda x: transformXtiles(x, "A", "A_xtile", [0.2, 0.4, 0.6, 0.8, 1.0])) 

The problem is that the above code will not add the new column "A_xtile". It just returns my data frame unchanged. If I first add a column full of dummy values, like NaN, called "A_xtile", then it does successfully over-write this column to include the correct quintile markings.

But it is extremely inconvenient to have to first write in the column for anything like this that I may want to add on the fly.

Note that a simple apply will not work here, since it won't know how to make sense of the possibly differently-sized result arrays for each group.

like image 508
ely Avatar asked Sep 12 '12 13:09

ely


People also ask

How do I add a specific column to a DataFrame in Python?

Answer. Yes, you can add a new column in a specified position into a dataframe, by specifying an index and using the insert() function. By default, adding a column will always add it as the last column of a dataframe. This will insert the column at index 2, and fill it with the data provided by data .

How do you add a column to a DataFrame in Python based on another column?

Using apply() method If you need to apply a method over an existing column in order to compute some values that will eventually be added as a new column in the existing DataFrame, then pandas. DataFrame. apply() method should do the trick.


1 Answers

What problems are you running into with apply? It works for this toy example here and the group lengths are different:

In [82]: df Out[82]:     X         Y 0  0 -0.631214 1  0  0.783142 2  0  0.526045 3  1 -1.750058 4  1  1.163868 5  1  1.625538 6  1  0.076105 7  2  0.183492 8  2  0.541400 9  2 -0.672809  In [83]: def func(x):    ....:     x['NewCol'] = np.nan    ....:     return x    ....:   In [84]: df.groupby('X').apply(func) Out[84]:     X         Y  NewCol 0  0 -0.631214     NaN 1  0  0.783142     NaN 2  0  0.526045     NaN 3  1 -1.750058     NaN 4  1  1.163868     NaN 5  1  1.625538     NaN 6  1  0.076105     NaN 7  2  0.183492     NaN 8  2  0.541400     NaN 9  2 -0.672809     NaN 
like image 110
Chang She Avatar answered Sep 29 '22 09:09

Chang She