I want to mark some quantiles in my data, and for each row of the DataFrame, I would like the entry in a new column called e.g. "xtile" to hold this value.
For example, suppose I create a data frame like this:
import pandas, numpy as np dfrm = pandas.DataFrame({'A':np.random.rand(100), 'B':(50+np.random.randn(100)), 'C':np.random.randint(low=0, high=3, size=(100,))}) And let's say I write my own function to compute the quintile of each element in an array. I have my own function for this, but for example just refer to scipy.stats.mstats.mquantile.
import scipy.stats as st def mark_quintiles(x, breakpoints): # Assume this is filled in, using st.mstats.mquantiles. # This returns an array the same shape as x, with an integer for which # breakpoint-bucket that entry of x falls into. Now, the real question is how to use transform to add a new column to the data. Something like this:
def transformXtiles(dataFrame, inputColumnName, newColumnName, breaks): dataFrame[newColumnName] = mark_quintiles(dataFrame[inputColumnName].values, breaks) return dataFrame And then:
dfrm.groupby("C").transform(lambda x: transformXtiles(x, "A", "A_xtile", [0.2, 0.4, 0.6, 0.8, 1.0])) The problem is that the above code will not add the new column "A_xtile". It just returns my data frame unchanged. If I first add a column full of dummy values, like NaN, called "A_xtile", then it does successfully over-write this column to include the correct quintile markings.
But it is extremely inconvenient to have to first write in the column for anything like this that I may want to add on the fly.
Note that a simple apply will not work here, since it won't know how to make sense of the possibly differently-sized result arrays for each group.
Answer. Yes, you can add a new column in a specified position into a dataframe, by specifying an index and using the insert() function. By default, adding a column will always add it as the last column of a dataframe. This will insert the column at index 2, and fill it with the data provided by data .
Using apply() method If you need to apply a method over an existing column in order to compute some values that will eventually be added as a new column in the existing DataFrame, then pandas. DataFrame. apply() method should do the trick.
What problems are you running into with apply? It works for this toy example here and the group lengths are different:
In [82]: df Out[82]: X Y 0 0 -0.631214 1 0 0.783142 2 0 0.526045 3 1 -1.750058 4 1 1.163868 5 1 1.625538 6 1 0.076105 7 2 0.183492 8 2 0.541400 9 2 -0.672809 In [83]: def func(x): ....: x['NewCol'] = np.nan ....: return x ....: In [84]: df.groupby('X').apply(func) Out[84]: X Y NewCol 0 0 -0.631214 NaN 1 0 0.783142 NaN 2 0 0.526045 NaN 3 1 -1.750058 NaN 4 1 1.163868 NaN 5 1 1.625538 NaN 6 1 0.076105 NaN 7 2 0.183492 NaN 8 2 0.541400 NaN 9 2 -0.672809 NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With