Calculating new column in Pandas dataframe

Question

While there are some similar questions, I can't find a straightforward answer to the following. Note that I am coming from R, and quite new to Pandas.

Say I have a Pandas dataframe, df, that contains two columns: "measure" (unicode with 3 levels) and "Airquality" (numpy.float64).

I want to create a third column named "color", that is based on values in "Airquality". Further, I want to do this separately for each level of "measure". I have succeeded by splitting the df on "measure" using df.loc. I then calculated "color" separately in each df using the following code:

#calculate the maximum value of "Airquality" in df for each "measure" level:
maxi = df['Airquality'].max()

#initialize the column for "color" in df for each "measure" level:
df['color'] = None

#find the maximum value of "Airquality" in df for each "measure" level:
maxi = df['Airquality'].max()

#loop through the rows calculating and assigning the value for color,
#again, in df for each "measure" level
for i in range(len(df['Airquality'])):
    df['color'][i] = int(100*df['Airquality'][i]/maxi)]

However, this runs quite slowly with the large dataset I'm working with, and I'm sure there must be a much better way...probably using some Pandas function and likely without splitting the df into three, one for each "measure" level. Posting this in the hopes of learning from one of the many Python geniuses.

wanaryytel · Accepted Answer

I'm hardly a genius, but I'd go with pandas apply. Usage i.e. as such:

df['newcol'] = df.apply(lambda row: row['firstcolval'] * row['secondcolval'], axis=1)

More info in the docs as usual: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html

DSM · Answer

I think you can use the groupby tools, in particular transform. Starting from a frame (BTW, it's considered customary to present an example dataframe yourself):

In [21]: df = pd.DataFrame({"measure": ["a","a","b","a","c","c"],
    ...:                    "aq": [10,20,30,20,30,50]})

In [22]: df["colour"] = (100.0 * df["aq"] / 
                         df.groupby("measure")["aq"].transform(max))

In [23]: df
Out[23]: 
   aq measure  colour
0  10       a    50.0
1  20       a   100.0
2  30       b   100.0
3  20       a   100.0
4  30       c    60.0
5  50       c   100.0

which works because we get the right denominator by grouping on the measure column, finding the maximum of the aq column for each different value of measure, and broadcasting it up to the whole frame, which is what this does:

In [24]: df.groupby("measure")["aq"].transform(max)
Out[24]: 
0    20
1    20
2    30
3    20
4    50
5    50
Name: aq, dtype: int64

Calculating new column in Pandas dataframe

Tags:

python

pandas

davo1979

2 Answers

wanaryytel

DSM

Recent Activity

Donate For Us

Calculating new column in Pandas dataframe

Tags:

python

pandas

davo1979

2 Answers

wanaryytel

DSM

Related questions

Recent Activity

Donate For Us