While there are some similar questions, I can't find a straightforward answer to the following. Note that I am coming from R, and quite new to Pandas.
Say I have a Pandas dataframe, df, that contains two columns: "measure" (unicode with 3 levels) and "Airquality" (numpy.float64).
I want to create a third column named "color", that is based on values in "Airquality". Further, I want to do this separately for each level of "measure". I have succeeded by splitting the df on "measure" using df.loc. I then calculated "color" separately in each df using the following code:
#calculate the maximum value of "Airquality" in df for each "measure" level:
maxi = df['Airquality'].max()
#initialize the column for "color" in df for each "measure" level:
df['color'] = None
#find the maximum value of "Airquality" in df for each "measure" level:
maxi = df['Airquality'].max()
#loop through the rows calculating and assigning the value for color,
#again, in df for each "measure" level
for i in range(len(df['Airquality'])):
df['color'][i] = int(100*df['Airquality'][i]/maxi)]
However, this runs quite slowly with the large dataset I'm working with, and I'm sure there must be a much better way...probably using some Pandas function and likely without splitting the df into three, one for each "measure" level. Posting this in the hopes of learning from one of the many Python geniuses.
I'm hardly a genius, but I'd go with pandas apply
. Usage i.e. as such:
df['newcol'] = df.apply(lambda row: row['firstcolval'] * row['secondcolval'], axis=1)
More info in the docs as usual: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html
I think you can use the groupby
tools, in particular transform
. Starting from a frame (BTW, it's considered customary to present an example dataframe yourself):
In [21]: df = pd.DataFrame({"measure": ["a","a","b","a","c","c"],
...: "aq": [10,20,30,20,30,50]})
In [22]: df["colour"] = (100.0 * df["aq"] /
df.groupby("measure")["aq"].transform(max))
In [23]: df
Out[23]:
aq measure colour
0 10 a 50.0
1 20 a 100.0
2 30 b 100.0
3 20 a 100.0
4 30 c 60.0
5 50 c 100.0
which works because we get the right denominator by grouping on the measure column, finding the maximum of the aq column for each different value of measure, and broadcasting it up to the whole frame, which is what this does:
In [24]: df.groupby("measure")["aq"].transform(max)
Out[24]:
0 20
1 20
2 30
3 20
4 50
5 50
Name: aq, dtype: int64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With