Python Pandas: how to add a totally new column to a data frame inside of a groupby/transform operation

Tags:

I want to mark some quantiles in my data, and for each row of the DataFrame, I would like the entry in a new column called e.g. "xtile" to hold this value.

For example, suppose I create a data frame like this:

import pandas, numpy as np dfrm = pandas.DataFrame({'A':np.random.rand(100),                           'B':(50+np.random.randn(100)),                           'C':np.random.randint(low=0, high=3, size=(100,))})

And let's say I write my own function to compute the quintile of each element in an array. I have my own function for this, but for example just refer to scipy.stats.mstats.mquantile.

import scipy.stats as st def mark_quintiles(x, breakpoints):     # Assume this is filled in, using st.mstats.mquantiles.     # This returns an array the same shape as x, with an integer for which     # breakpoint-bucket that entry of x falls into.

Now, the real question is how to use transform to add a new column to the data. Something like this:

def transformXtiles(dataFrame, inputColumnName, newColumnName, breaks):     dataFrame[newColumnName] = mark_quintiles(dataFrame[inputColumnName].values,                                                breaks)     return dataFrame

And then:

dfrm.groupby("C").transform(lambda x: transformXtiles(x, "A", "A_xtile", [0.2, 0.4, 0.6, 0.8, 1.0]))

The problem is that the above code will not add the new column "A_xtile". It just returns my data frame unchanged. If I first add a column full of dummy values, like NaN, called "A_xtile", then it does successfully over-write this column to include the correct quintile markings.

But it is extremely inconvenient to have to first write in the column for anything like this that I may want to add on the fly.

Note that a simple apply will not work here, since it won't know how to make sense of the possibly differently-sized result arrays for each group.

508

asked Sep 12 '12 13:09

ely

1 Answers

What problems are you running into with apply? It works for this toy example here and the group lengths are different:

In [82]: df Out[82]:     X         Y 0  0 -0.631214 1  0  0.783142 2  0  0.526045 3  1 -1.750058 4  1  1.163868 5  1  1.625538 6  1  0.076105 7  2  0.183492 8  2  0.541400 9  2 -0.672809  In [83]: def func(x):    ....:     x['NewCol'] = np.nan    ....:     return x    ....:   In [84]: df.groupby('X').apply(func) Out[84]:     X         Y  NewCol 0  0 -0.631214     NaN 1  0  0.783142     NaN 2  0  0.526045     NaN 3  1 -1.750058     NaN 4  1  1.163868     NaN 5  1  1.625538     NaN 6  1  0.076105     NaN 7  2  0.183492     NaN 8  2  0.541400     NaN 9  2 -0.672809     NaN

110

answered Sep 29 '22 09:09

Chang She

Related questions
                            
                                What is a frozen Python module?
                            
                                Execute a Python script post install using distutils / setuptools
                            
                                argparse subparser monolithic help output
                            
                                What are the differences amongst Python's "__get*__" and "_del*__" methods?
                            
                                How to use TokenAuthentication for API in django-rest-framework
                            
                                Populate a Pandas SparseDataFrame from a SciPy Sparse Matrix
                            
                                Python crashing when running two commands (Segmentation Fault: 11)
                            
                                python-How to set global variables in Flask? [duplicate]
                            
                                Self-reference or forward-reference of type annotations in Python [duplicate]
                            
                                How do I document a constructor for a class using Python dataclasses?
                            
                                How to write Python 2.x as much compatible with Python 3.x as possible?
                            
                                Create dictionary from list of variables
                            
                                Scipy sparse matrices - purpose and usage of different implementations
                            
                                Computing np.diff in Pandas after using groupby leads to unexpected result
                            
                                Difference between installation libraries of Tensorflow GPU vs CPU
                            
                                Use Cython as Python to C Converter
                            
                                installing Mayavi with pip - no module named vtk
                            
                                Efficient & pythonic check for singular matrix
                            
                                Super init vs. parent.__init__
                            
                                Is that a bad idea to use conda and pip install on the same environment?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python Pandas: how to add a totally new column to a data frame inside of a groupby/transform operation

Tags:

python

pandas

dataframe

group-by

transform

ely

People also ask

1 Answers

Chang She

Recent Activity

Donate For Us