Example dataset:
>>> df ID Region count 0 100 Asia 2 1 101 Europe 3 2 102 US 1 3 103 Africa 5 4 100 Russia 5 5 101 Australia 7 6 102 US 8 7 104 Asia 10 8 105 Europe 11 9 110 Africa 23
I want to group the observations of this dataset by ID
and Region
and summing the count
for each group. So I used something like this...
>>> print(df.groupby(['ID','Region'],as_index=False).count().sum()) ID Region count 0 100 Asia 2 1 100 Russia 5 2 101 Australia 7 3 101 Europe 3 4 102 US 9 5 103 Africa 5 6 104 Asia 10 7 105 Europe 11 8 110 Africa 23
On using as_index=False
I am able to get "SQL-Like" output. My problem is that I am unable to rename the aggregate variable count
here. So in SQL if wanted to do the above thing I would do something like this:
select ID, Region, sum(count) as Total_Numbers from df group by ID, Region order by ID, Region
As we see, it's very easy for me to rename the aggregate variable count
to Total_Numbers
in SQL. I wanted to do the same thing in Pandas but unable to find such an option in group-by function. Can somebody help?
The second question (more of an observation) is whether...
I understand that the variable names are strings, so have to be inside quotes, but I see if use them outside dataframe function and as an attribute we don't require them to be inside quotes. Like df.ID.sum()
etc. It's only when we use it in a DataFrame function like df.sort()
or df.groupby
we have to use it inside quotes. This is actually a bit of pain as in SQL or in SAS or other languages we simply use the variable name without quoting them. Any suggestion on this?
Kindly reply to both questions (Q1 is the main, Q2 more of an opinion).
The current (as of version 0.20) method for changing column names after a groupby operation is to chain the rename method. See this deprecation note in the documentation for more detail.
Pandas, however, can be tricked into allowing duplicate column names. Duplicate column names are a problem if you plan to transfer your data set to another statistical language. They're also a problem because it will cause unanticipated and sometimes difficult to debug problems in Python.
To drop duplicate columns from pandas DataFrame use df. T. drop_duplicates(). T , this removes all columns that have the same data regardless of column names.
For the first question I think answer would be:
<your DataFrame>.rename(columns={'count':'Total_Numbers'})
or
<your DataFrame>.columns = ['ID', 'Region', 'Total_Numbers']
As for second one I'd say the answer would be no. It's possible to use it like 'df.ID' because of python datamodel:
Attribute references are translated to lookups in this dictionary, e.g., m.x is equivalent to m.dict["x"]
The current (as of version 0.20) method for changing column names after a groupby operation is to chain the rename
method. See this deprecation note in the documentation for more detail.
This is the first result in google and although the top answer works it does not really answer the question. There is a better answer here and a long discussion on github about the full functionality of passing dictionaries to the agg
method.
These answers unfortunately do not exist in the documentation but the general format for grouping, aggregating and then renaming columns uses a dictionary of dictionaries. The keys to the outer dictionary are column names that are to be aggregated. The inner dictionaries have keys that the new column names with values as the aggregating function.
Before we get there, let's create a four column DataFrame.
df = pd.DataFrame({'A' : list('wwwwxxxx'), 'B':list('yyzzyyzz'), 'C':np.random.rand(8), 'D':np.random.rand(8)}) A B C D 0 w y 0.643784 0.828486 1 w y 0.308682 0.994078 2 w z 0.518000 0.725663 3 w z 0.486656 0.259547 4 x y 0.089913 0.238452 5 x y 0.688177 0.753107 6 x z 0.955035 0.462677 7 x z 0.892066 0.368850
Let's say we want to group by columns A, B
and aggregate column C
with mean
and median
and aggregate column D
with max
. The following code would do this.
df.groupby(['A', 'B']).agg({'C':['mean', 'median'], 'D':'max'}) D C max mean median A B w y 0.994078 0.476233 0.476233 z 0.725663 0.502328 0.502328 x y 0.753107 0.389045 0.389045 z 0.462677 0.923551 0.923551
This returns a DataFrame with a hierarchical index. The original question asked about renaming the columns in the same step. This is possible using a dictionary of dictionaries:
df.groupby(['A', 'B']).agg({'C':{'C_mean': 'mean', 'C_median': 'median'}, 'D':{'D_max': 'max'}}) D C D_max C_mean C_median A B w y 0.994078 0.476233 0.476233 z 0.725663 0.502328 0.502328 x y 0.753107 0.389045 0.389045 z 0.462677 0.923551 0.923551
This renames the columns all in one go but still leaves the hierarchical index which the top level can be dropped with df.columns = df.columns.droplevel(0)
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With