Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In Pandas, after groupby the grouped column is gone

Tags:

python

pandas

I have the following dataframe named ttm:

    usersidid   clienthostid    eventSumTotal   LoginDaysSum    score
0       12          1               60              3           1728
1       11          1               240             3           1331
3       5           1               5               3           125
4       6           1               16              2           216
2       10          3               270             3           1000
5       8           3               18              2           512

When i do

ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()

I get what I expected (though I would've wanted the results to be under a new label named 'ratio'):

       clienthostid  LoginDaysSum
0             1          4
1             3          2

But when I do

ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].apply(lambda x: x.iloc[0] / x.iloc[1])

I get:

0    1.0
1    1.5
  1. Why did the labels go? I still also need the grouped need the 'clienthostid' and I need also the results of the apply to be under a label too
  2. Sometimes when I do groupby some of the other columns still appear, why is that that sometimes columns disappear and sometime stays? is there a flag I'm missing that do those stuff?
  3. In the example that I gave, when I did count the results showed on label 'LoginDaysSum', is there a why to add a new label for the results instead?

Thank you,

like image 515
O. San Avatar asked Jan 15 '17 06:01

O. San


People also ask

How do I retrieve a column in Pandas?

You can use the loc and iloc functions to access columns in a Pandas DataFrame. Let's see how. If we wanted to access a certain column in our DataFrame, for example the Grades column, we could simply use the loc function and specify the name of the column in order to retrieve it.

Does groupby return a Dataframe or series?

So a groupby() operation can downcast to a Series, or if given a Series as input, can upcast to dataframe. For your first dataframe, you run unequal groupings (or unequal index lengths) coercing a series return which in the "combine" processing does not adequately yield a data frame.

What does PD groupby return?

Returns a groupby object that contains information about the groups. Convenience method for frequency conversion and resampling of time series. See the user guide for more detailed usage and examples, including splitting an object into groups, iterating through groups, selecting a group, aggregation, and more.

How do you sort values after groupby?

Sort Values in Descending Order with Groupby You can sort values in descending order by using ascending=False param to sort_values() method. The head() function is used to get the first n rows. It is useful for quickly testing if your object has the right type of data in it.


1 Answers

For return DataFrame after groupby are 2 possible solutions:

  1. parameter as_index=False what works nice with count, sum, mean functions

  2. reset_index for create new column from levels of index, more general solution

df = ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()
print (df)
   clienthostid  LoginDaysSum
0             1             4
1             3             2
df = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum'].count().reset_index()
print (df)
   clienthostid  LoginDaysSum
0             1             4
1             3             2

For second need remove as_index=False and instead add reset_index:

#output is `Series`
a = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum'] \
         .apply(lambda x: x.iloc[0] / x.iloc[1])
print (a)
clienthostid
1    1.0
3    1.5
Name: LoginDaysSum, dtype: float64

print (type(a))
<class 'pandas.core.series.Series'>

print (a.index)
Int64Index([1, 3], dtype='int64', name='clienthostid')


df1 = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum']
         .apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio')
print (df1)
   clienthostid  ratio
0             1    1.0
1             3    1.5

Why some columns are gone?

I think there can be problem automatic exclusion of nuisance columns:

#convert column to str
ttm.usersidid = ttm.usersidid.astype(str) + 'aa'
print (ttm)
  usersidid  clienthostid  eventSumTotal  LoginDaysSum  score
0      12aa             1             60             3   1728
1      11aa             1            240             3   1331
3       5aa             1              5             3    125
4       6aa             1             16             2    216
2      10aa             3            270             3   1000
5       8aa             3             18             2    512

#removed str column userid
a = ttm.groupby(['clienthostid'], sort=False).sum()
print (a)
              eventSumTotal  LoginDaysSum  score
clienthostid                                    
1                       321            11   3400
3                       288             5   1512

What is the difference between size and count in pandas?

like image 172
jezrael Avatar answered Oct 02 '22 18:10

jezrael