Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

using groupby/aggregate to return multiple columns

Tags:

python

pandas

I have an example dataset that I want to groupby one column and then produce 4 new columns based on all of the values of existing columns.

Here is some sample data:

data = {'AlignmentId': {0: u'ENSMUST00000000001.4-1',
  1: u'ENSMUST00000000001.4-1',
  2: u'ENSMUST00000000003.13-0',
  3: u'ENSMUST00000000003.13-0',
  4: u'ENSMUST00000000003.13-0'},
 'name': {0: u'NonCodingDeletion',
  1: u'NonCodingInsertion',
  2: u'CodingDeletion',
  3: u'CodingInsertion',
  4: u'NonCodingDeletion'},
 'value_CDS': {0: nan, 1: nan, 2: 1.0, 3: 1.0, 4: nan},
 'value_mRNA': {0: 21.0, 1: 26.0, 2: 1.0, 3: 1.0, 4: 2.0}}
df = pd.DataFrame.from_dict(data)

Which looks like this:

               AlignmentId                name  value_mRNA  value_CDS
0   ENSMUST00000000001.4-1   NonCodingDeletion        21.0        NaN
1   ENSMUST00000000001.4-1  NonCodingInsertion        26.0        NaN
2  ENSMUST00000000003.13-0      CodingDeletion         1.0        1.0
3  ENSMUST00000000003.13-0     CodingInsertion         1.0        1.0
4  ENSMUST00000000003.13-0   NonCodingDeletion         2.0        NaN

I want to return booleans based on the presence/absence of values in the name column depending on whether the value_CDS contains only null values. I produced this function to do so:

def aggfunc(s):
    if s.value_CDS.any():
        c = set(s.name)
    else:
        c = set(s.name)
    return ('CodingDeletion' in c or 'CodingInsertion' in c, 
            'CodingInsertion' in c, 'CodingDeletion' in c, 
            'CodingMult3Deletion' in c or 'CodingMult3Insertion' in c)

And did this:

merged = df.groupby('AlignmentId').aggregate(aggfunc)

Which gives me the error ValueError: Shape of passed values is (318, 4), indices imply (318, 3).

How can I return multiple new columns from an groupby-aggregate?

The output I am looking for is:

ENSMUST00000000001.4-1 (False, False, False, False)
ENSMUST00000000003.13-0 (True, True, True, False)

Which I would then ideally put into a 5-column dataframe.

If I use .apply, the output is incorrect:

ENSMUST00000000001.4-1     (False, False, False, False)
ENSMUST00000000003.13-0    (False, False, False, False)

But if I grab the groups one at a time, it is correct:

In [380]: for aln_id, d in df.groupby('AlignmentId'):
   .....:     print aggfunc(d)
   .....:
(False, False, False, False)
(True, True, True, False)
like image 927
Ian Fiddes Avatar asked Aug 18 '17 04:08

Ian Fiddes


People also ask

How do you aggregate multiple columns in Python?

To apply aggregations to multiple columns, just add additional key:value pairs to the dictionary. Applying multiple aggregation functions to a single column will result in a multiindex. Working with multi-indexed columns is a pain and I'd recommend flattening this after aggregating by renaming the new columns.

Can you use Groupby with multiple columns in pandas?

How to groupby multiple columns in pandas DataFrame and compute multiple aggregations? groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.

How do I group multiple columns in pandas?

Using GroupBy on a Pandas DataFrame is overall simple: we first need to group the data according to one or more columns ; we'll then apply some aggregation function / logic, being it mix, max, sum, mean / average etc'.


Video Answer


1 Answers

You need change name to ['name'], because .name return name of group (value of column grouping by):

def aggfunc(s):
    if s.value_CDS.any():
        c = set(s['name'])
    else:
        c = set(s['name'])

    return ('CodingDeletion' in c or 'CodingInsertion' in c, 
            'CodingInsertion' in c, 'CodingDeletion' in c, 
            'CodingMult3Deletion' in c or 'CodingMult3Insertion' in c)

merged = df.groupby('AlignmentId').apply(aggfunc)
print (merged)
AlignmentId
ENSMUST00000000001.4-1     (False, False, False, False)
ENSMUST00000000003.13-0       (True, True, True, False)
dtype: object

def aggfunc(s):

    print ('Name of group is: {}'.format((s.name)))  
    print ('Column name is:\n {}'.format(s['name']))  


merged = df.groupby('AlignmentId').apply(aggfunc)
print (merged)

Name of group is: ENSMUST00000000001.4-1
Column name is:
 0     NonCodingDeletion
1    NonCodingInsertion
Name: name, dtype: object
Name of group is: ENSMUST00000000001.4-1
Column name is:
 0     NonCodingDeletion
1    NonCodingInsertion
Name: name, dtype: object
Name of group is: ENSMUST00000000003.13-0
Column name is:
 2       CodingDeletion
3      CodingInsertion
4    NonCodingDeletion
Name: name, dtype: object

Improved code:

def aggfunc(s):
    #if and else return same c, so omitted
    c = set(s['name'])

    #added Series for return columns instead tuples
    cols = ['col1','col2','col3','col4']
    return pd.Series(('CodingDeletion' in c or 'CodingInsertion' in c, 
            'CodingInsertion' in c, 'CodingDeletion' in c, 
            'CodingMult3Deletion' in c or 'CodingMult3Insertion' in c), index=cols)

merged = df.groupby('AlignmentId').apply(aggfunc)
print (merged)

                          col1   col2   col3   col4
AlignmentId                                        
ENSMUST00000000001.4-1   False  False  False  False
ENSMUST00000000003.13-0   True   True   True  False
like image 137
jezrael Avatar answered Sep 18 '22 16:09

jezrael