Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculating new column in Pandas dataframe

Tags:

python

pandas

While there are some similar questions, I can't find a straightforward answer to the following. Note that I am coming from R, and quite new to Pandas.

Say I have a Pandas dataframe, df, that contains two columns: "measure" (unicode with 3 levels) and "Airquality" (numpy.float64).

I want to create a third column named "color", that is based on values in "Airquality". Further, I want to do this separately for each level of "measure". I have succeeded by splitting the df on "measure" using df.loc. I then calculated "color" separately in each df using the following code:

#calculate the maximum value of "Airquality" in df for each "measure" level:
maxi = df['Airquality'].max()

#initialize the column for "color" in df for each "measure" level:
df['color'] = None

#find the maximum value of "Airquality" in df for each "measure" level:
maxi = df['Airquality'].max()

#loop through the rows calculating and assigning the value for color,
#again, in df for each "measure" level
for i in range(len(df['Airquality'])):
    df['color'][i] = int(100*df['Airquality'][i]/maxi)]

However, this runs quite slowly with the large dataset I'm working with, and I'm sure there must be a much better way...probably using some Pandas function and likely without splitting the df into three, one for each "measure" level. Posting this in the hopes of learning from one of the many Python geniuses.

like image 410
davo1979 Avatar asked Jan 25 '17 01:01

davo1979


2 Answers

I'm hardly a genius, but I'd go with pandas apply. Usage i.e. as such:

df['newcol'] = df.apply(lambda row: row['firstcolval'] * row['secondcolval'], axis=1)

More info in the docs as usual: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html

like image 117
wanaryytel Avatar answered Oct 21 '22 08:10

wanaryytel


I think you can use the groupby tools, in particular transform. Starting from a frame (BTW, it's considered customary to present an example dataframe yourself):

In [21]: df = pd.DataFrame({"measure": ["a","a","b","a","c","c"],
    ...:                    "aq": [10,20,30,20,30,50]})

In [22]: df["colour"] = (100.0 * df["aq"] / 
                         df.groupby("measure")["aq"].transform(max))

In [23]: df
Out[23]: 
   aq measure  colour
0  10       a    50.0
1  20       a   100.0
2  30       b   100.0
3  20       a   100.0
4  30       c    60.0
5  50       c   100.0

which works because we get the right denominator by grouping on the measure column, finding the maximum of the aq column for each different value of measure, and broadcasting it up to the whole frame, which is what this does:

In [24]: df.groupby("measure")["aq"].transform(max)
Out[24]: 
0    20
1    20
2    30
3    20
4    50
5    50
Name: aq, dtype: int64
like image 23
DSM Avatar answered Oct 21 '22 08:10

DSM