So I was trying to understand pandas.dataFrame.groupby() function and I came across this example on the documentation:
In [1]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
...: 'foo', 'bar', 'foo', 'foo'],
...: 'B' : ['one', 'one', 'two', 'three',
...: 'two', 'two', 'one', 'three'],
...: 'C' : np.random.randn(8),
...: 'D' : np.random.randn(8)})
...:
In [2]: df
Out[2]:
A B C D
0 foo one 0.469112 -0.861849
1 bar one -0.282863 -2.104569
2 foo two -1.509059 -0.494929
3 bar three -1.135632 1.071804
4 foo two 1.212112 0.721555
5 bar two -0.173215 -0.706771
6 foo one 0.119209 -1.039575
7 foo three -1.044236 0.271860
Not to further explore I did this:
print(df.groupby('B').head())
it outputs the same dataFrame but when I do this:
print(df.groupby('B'))
it gives me this:
<pandas.core.groupby.DataFrameGroupBy object at 0x7f65a585b390>
What does this mean? In a normal dataFrame printing .head()
simply outputs the first 5 rows what's happening here?
And also why does printing .head()
gives the same output as the dataframe? Shouldn't it be grouped by the elements of the column 'B'
?
groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names. sort : Sort group keys.
The pandas groupby is implemented in highly-optimized cython code, and provides a nice baseline of comparison for our exploration.
Group DataFrame using a mapper or by a Series of columns. A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups. Used to determine the groups for the groupby.
Groupby is a very popular function in Pandas. This is very good at summarising, transforming, filtering, and a few other very essential data analysis tasks.
When you use just
df.groupby('A')
You get a GroupBy
object. You haven't applied any function to it at that point. Under the hood, while this definition might not be perfect, you can think of a groupby
object as:
To illustrate:
df = DataFrame({'A' : [1, 1, 2, 2], 'B' : [1, 2, 3, 4]})
grouped = df.groupby('A')
# each `i` is a tuple of (group, DataFrame)
# so your output here will be a little messy
for i in grouped:
print(i)
(1, A B
0 1 1
1 1 2)
(2, A B
2 2 3
3 2 4)
# this version uses multiple counters
# in a single loop. each `group` is a group, each
# `df` is its corresponding DataFrame
for group, df in grouped:
print('group of A:', group, '\n')
print(df, '\n')
group of A: 1
A B
0 1 1
1 1 2
group of A: 2
A B
2 2 3
3 2 4
# and if you just wanted to visualize the groups,
# your second counter is a "throwaway"
for group, _ in grouped:
print('group of A:', group, '\n')
group of A: 1
group of A: 2
Now as for .head
. Just have a look at the docs for that method:
Essentially equivalent to
.apply(lambda x: x.head(n))
So here you're actually applying a function to each group of the groupby object. Keep in mind .head(5)
is applied to each group (each DataFrame), so because you have less than or equal to 5 rows per group, you get your original DataFrame.
Consider this with the example above. If you use .head(1)
, you get only the first 1 row of each group:
print(df.groupby('A').head(1))
A B
0 1 1
2 2 3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With