Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Group by combination of two columns in Pandas 0.23.4

I am fairly new to Python. I came across Pandas: Group by combination of two columns on SO. Unfortunately, the accepted answer no longer works with pandas version 0.23.4 The objective of that post is to figure out combination of group variables, and create a dictionary for values. i.e. group_by should ignore the order of grouping.

Here's the accepted answer:

import pandas as pd
from collections import Counter

d = pd.DataFrame([('a','b',1), ('a','c', 2), ('b','a',3), ('b','a',3)],
                 columns=['x', 'y', 'score'])

d[['x', 'y']] = d[['x', 'y']].apply(sorted, axis=1)
x = d.groupby(['x', 'y']).agg(Counter)
print(x)

Here, ...apply(sorted) throws the following exception:

raise ValueError('Must have equal len keys and value ' ValueError: Must have equal len keys and value when setting with an iterable

Here's my pandas version:

> pd.__version__
Out: '0.23.4'

Here's what I tried after reading https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html:

d = pd.DataFrame([('a','b',1), ('a','c', 2), ('b','a',3), ('b','a',3)],
                 columns=['x', 'y', 'score'])

d=d.sort_values(by=['x','y'],axis=1).reset_index(drop=True)
x = d.groupby(['x', 'y']).agg(Counter)
print(x)

Unfortunately, this also throws error:

1382, in _get_label_or_level_values raise KeyError(key) KeyError: 'x'

Expected output:

        score           count
x   y                     
a   b   {1: 1, 3: 2}      2
    c   {2: 1}            1 

Can someone please help me? On a side note, it will be great if you could also guide on how to compute the count of keys() in score column. I am looking for a vectorized solution.

I am using python 3.6.7

Many thanks.

like image 303
watchtower Avatar asked Dec 06 '25 18:12

watchtower


2 Answers

Problem is sorted return lists, so is necessary convert ti to Series:

d[['x', 'y']] = d[['x', 'y']].apply(lambda x: pd.Series(sorted(x)), axis=1)

But faster is use numpy.sort with DataFrame constructor, because apply are loops under the hood:

d = pd.DataFrame([('a','b',1), ('a','c', 2), ('b','a',3), ('b','a',3)],
                 columns=['x', 'y', 'score'])

d[['x', 'y']] = pd.DataFrame(np.sort(d[['x', 'y']], axis=1), index=d.index)

Then seelct column for aggregation with list of aggregated functions - e.g. nunique for count of number of unique values:

x = d.groupby(['x', 'y'])['score'].agg([Counter, 'nunique'])
print(x)
          Counter  nunique
x y                       
a b  {1: 1, 3: 2}        2
  c        {2: 1}        1

Or count by DataFrameGroupBy.size:

x = d.groupby(['x', 'y'])['score'].agg([Counter, 'size'])
print(x)
          Counter  size
x y                    
a b  {1: 1, 3: 2}     3
  c        {2: 1}     1
like image 109
jezrael Avatar answered Dec 09 '25 14:12

jezrael


Use -

a=d[['x','y']].values
a.sort(axis=1)
d[['x','y']] = a
x = d.groupby(['x', 'y']).agg(Counter)
print(x)

Output

            score
x y              
a b  {1: 1, 3: 2}
  c        {2: 1}
like image 26
Vivek Kalyanarangan Avatar answered Dec 09 '25 14:12

Vivek Kalyanarangan



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!