I have the following dataframe named ttm:
usersidid clienthostid eventSumTotal LoginDaysSum score
0 12 1 60 3 1728
1 11 1 240 3 1331
3 5 1 5 3 125
4 6 1 16 2 216
2 10 3 270 3 1000
5 8 3 18 2 512
When i do
ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()
I get what I expected (though I would've wanted the results to be under a new label named 'ratio'):
clienthostid LoginDaysSum
0 1 4
1 3 2
But when I do
ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].apply(lambda x: x.iloc[0] / x.iloc[1])
I get:
0 1.0
1 1.5
Thank you,
You can use the loc and iloc functions to access columns in a Pandas DataFrame. Let's see how. If we wanted to access a certain column in our DataFrame, for example the Grades column, we could simply use the loc function and specify the name of the column in order to retrieve it.
So a groupby() operation can downcast to a Series, or if given a Series as input, can upcast to dataframe. For your first dataframe, you run unequal groupings (or unequal index lengths) coercing a series return which in the "combine" processing does not adequately yield a data frame.
Returns a groupby object that contains information about the groups. Convenience method for frequency conversion and resampling of time series. See the user guide for more detailed usage and examples, including splitting an object into groups, iterating through groups, selecting a group, aggregation, and more.
Sort Values in Descending Order with Groupby You can sort values in descending order by using ascending=False param to sort_values() method. The head() function is used to get the first n rows. It is useful for quickly testing if your object has the right type of data in it.
For return DataFrame
after groupby
are 2 possible solutions:
parameter as_index=False
what works nice with count
, sum
, mean
functions
reset_index
for create new column from levels of index
, more general solution
df = ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()
print (df)
clienthostid LoginDaysSum
0 1 4
1 3 2
df = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum'].count().reset_index()
print (df)
clienthostid LoginDaysSum
0 1 4
1 3 2
For second need remove as_index=False
and instead add reset_index
:
#output is `Series`
a = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum'] \
.apply(lambda x: x.iloc[0] / x.iloc[1])
print (a)
clienthostid
1 1.0
3 1.5
Name: LoginDaysSum, dtype: float64
print (type(a))
<class 'pandas.core.series.Series'>
print (a.index)
Int64Index([1, 3], dtype='int64', name='clienthostid')
df1 = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum']
.apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio')
print (df1)
clienthostid ratio
0 1 1.0
1 3 1.5
Why some columns are gone?
I think there can be problem automatic exclusion of nuisance columns:
#convert column to str
ttm.usersidid = ttm.usersidid.astype(str) + 'aa'
print (ttm)
usersidid clienthostid eventSumTotal LoginDaysSum score
0 12aa 1 60 3 1728
1 11aa 1 240 3 1331
3 5aa 1 5 3 125
4 6aa 1 16 2 216
2 10aa 3 270 3 1000
5 8aa 3 18 2 512
#removed str column userid
a = ttm.groupby(['clienthostid'], sort=False).sum()
print (a)
eventSumTotal LoginDaysSum score
clienthostid
1 321 11 3400
3 288 5 1512
What is the difference between size and count in pandas?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With