Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Group by index + column in pandas

Tags:

python

pandas

I have a dataframe that has two columns, user_id and item_bought. Here user_id is the index of the dataframe. I want to group by both user_id and item_bought and get the item wise count for the user.

How do I do that?

like image 882
vumaasha Avatar asked Jun 18 '15 20:06

vumaasha


People also ask

Can you Groupby index in Pandas?

How to perform groupby index in pandas? Pass index name of the DataFrame as a parameter to groupby() function to group rows on an index. DataFrame. groupby() function takes string or list as a param to specify the group columns or index.

How do I group values in a column in Pandas?

groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names. sort : Sort group keys.

How do you get index after Groupby Pandas?

Overview: Create a dataframe using an dictionary. Group by item_id and find the max value. enumerate over the grouped dataframe and use the key which is an numeric value to return the alpha index value. Create an result_df dataframe if you desire.

Does Groupby preserve index?

The Groupby Rolling function does not preserve the original index and so when dates are the same within the Group, it is impossible to know which index value it pertains to from the original dataframe.


2 Answers

From version 0.20.1 it is simplier:

Strings passed to DataFrame.groupby() as the by parameter may now reference either column names or index level names

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],           ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]  index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])  df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 3, 3],                    'B': np.arange(8)}, index=index)  print (df)                A  B first second       bar   one     1  0       two     1  1 baz   one     1  2       two     1  3 foo   one     2  4       two     2  5 qux   one     3  6       two     3  7  print (df.groupby(['second', 'A']).sum())           B second A    one    1  2        2  4        3  6 two    1  4        2  5        3  7 
like image 149
jezrael Avatar answered Sep 25 '22 09:09

jezrael


this should work:

>>> df = pd.DataFrame(np.random.randint(0,5,(6, 2)), columns=['col1','col2']) >>> df['ind1'] = list('AAABCC') >>> df['ind2'] = range(6) >>> df.set_index(['ind1','ind2'], inplace=True) >>> df             col1  col2 ind1 ind2             A    0        3     2      1        2     0      2        2     3 B    3        2     4 C    4        3     1      5        0     0   >>> df.groupby([df.index.get_level_values(0),'col1']).count()             col2 ind1 col1       A    2        2      3        1 B    2        1 C    0        1      3        1 

I had the same problem using one of the columns from multiindex. with multiindex, you cannot use df.index.levels[0] since it has only distinct values from that particular index level and will be most likely of different size than whole dataframe...

check http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.get_level_values.html - get_level_values "Return vector of label values for requested level, equal to the length of the index"

like image 41
kekert Avatar answered Sep 25 '22 09:09

kekert