using groupby/aggregate to return multiple columns

Tags:

python

pandas

I have an example dataset that I want to groupby one column and then produce 4 new columns based on all of the values of existing columns.

Here is some sample data:

data = {'AlignmentId': {0: u'ENSMUST00000000001.4-1',
  1: u'ENSMUST00000000001.4-1',
  2: u'ENSMUST00000000003.13-0',
  3: u'ENSMUST00000000003.13-0',
  4: u'ENSMUST00000000003.13-0'},
 'name': {0: u'NonCodingDeletion',
  1: u'NonCodingInsertion',
  2: u'CodingDeletion',
  3: u'CodingInsertion',
  4: u'NonCodingDeletion'},
 'value_CDS': {0: nan, 1: nan, 2: 1.0, 3: 1.0, 4: nan},
 'value_mRNA': {0: 21.0, 1: 26.0, 2: 1.0, 3: 1.0, 4: 2.0}}
df = pd.DataFrame.from_dict(data)

Which looks like this:

               AlignmentId                name  value_mRNA  value_CDS
0   ENSMUST00000000001.4-1   NonCodingDeletion        21.0        NaN
1   ENSMUST00000000001.4-1  NonCodingInsertion        26.0        NaN
2  ENSMUST00000000003.13-0      CodingDeletion         1.0        1.0
3  ENSMUST00000000003.13-0     CodingInsertion         1.0        1.0
4  ENSMUST00000000003.13-0   NonCodingDeletion         2.0        NaN

I want to return booleans based on the presence/absence of values in the name column depending on whether the value_CDS contains only null values. I produced this function to do so:

def aggfunc(s):
    if s.value_CDS.any():
        c = set(s.name)
    else:
        c = set(s.name)
    return ('CodingDeletion' in c or 'CodingInsertion' in c, 
            'CodingInsertion' in c, 'CodingDeletion' in c, 
            'CodingMult3Deletion' in c or 'CodingMult3Insertion' in c)

And did this:

merged = df.groupby('AlignmentId').aggregate(aggfunc)

Which gives me the error ValueError: Shape of passed values is (318, 4), indices imply (318, 3).

How can I return multiple new columns from an groupby-aggregate?

The output I am looking for is:

ENSMUST00000000001.4-1 (False, False, False, False)
ENSMUST00000000003.13-0 (True, True, True, False)

Which I would then ideally put into a 5-column dataframe.

If I use .apply, the output is incorrect:

ENSMUST00000000001.4-1     (False, False, False, False)
ENSMUST00000000003.13-0    (False, False, False, False)

But if I grab the groups one at a time, it is correct:

In [380]: for aln_id, d in df.groupby('AlignmentId'):
   .....:     print aggfunc(d)
   .....:
(False, False, False, False)
(True, True, True, False)

927

asked Aug 18 '17 04:08

Ian Fiddes

Video Answer

1 Answers

You need change name to ['name'], because .name return name of group (value of column grouping by):

def aggfunc(s):
    if s.value_CDS.any():
        c = set(s['name'])
    else:
        c = set(s['name'])

    return ('CodingDeletion' in c or 'CodingInsertion' in c, 
            'CodingInsertion' in c, 'CodingDeletion' in c, 
            'CodingMult3Deletion' in c or 'CodingMult3Insertion' in c)

merged = df.groupby('AlignmentId').apply(aggfunc)
print (merged)
AlignmentId
ENSMUST00000000001.4-1     (False, False, False, False)
ENSMUST00000000003.13-0       (True, True, True, False)
dtype: object

def aggfunc(s):

    print ('Name of group is: {}'.format((s.name)))  
    print ('Column name is:\n {}'.format(s['name']))  


merged = df.groupby('AlignmentId').apply(aggfunc)
print (merged)

Name of group is: ENSMUST00000000001.4-1
Column name is:
 0     NonCodingDeletion
1    NonCodingInsertion
Name: name, dtype: object
Name of group is: ENSMUST00000000001.4-1
Column name is:
 0     NonCodingDeletion
1    NonCodingInsertion
Name: name, dtype: object
Name of group is: ENSMUST00000000003.13-0
Column name is:
 2       CodingDeletion
3      CodingInsertion
4    NonCodingDeletion
Name: name, dtype: object

Improved code:

def aggfunc(s):
    #if and else return same c, so omitted
    c = set(s['name'])

    #added Series for return columns instead tuples
    cols = ['col1','col2','col3','col4']
    return pd.Series(('CodingDeletion' in c or 'CodingInsertion' in c, 
            'CodingInsertion' in c, 'CodingDeletion' in c, 
            'CodingMult3Deletion' in c or 'CodingMult3Insertion' in c), index=cols)

merged = df.groupby('AlignmentId').apply(aggfunc)
print (merged)

                          col1   col2   col3   col4
AlignmentId                                        
ENSMUST00000000001.4-1   False  False  False  False
ENSMUST00000000003.13-0   True   True   True  False

137

answered Sep 18 '22 16:09

jezrael

Related questions
                            
                                How to give chart title to a chart in Python-pptx chart in Chart Area(Not the slide title)
                            
                                Python: How to extend or append multiple elements in list comprehension format?
                            
                                Python pandas has no attribute ols - Error (rolling OLS)
                            
                                Str replace method happening inplace
                            
                                pickle.PicklingError: args[0] from __newobj__ args has the wrong class with hadoop python
                            
                                When Django models field is empty, set value to the Default value
                            
                                SKLearn: TypeError: __init__() got an unexpected keyword argument n_splits
                            
                                Pandas: SettingWithCopyWarning: [duplicate]
                            
                                How to use a consistent random sample in Python Pandas?
                            
                                ModuleNotFoundError: No module named 'pandas.rpy'
                            
                                Clearing lru_cache of certain methods when an attribute of the class is updated?
                            
                                Could someone explain why this fixes my recursion error?
                            
                                reconnect keyword argument on ClearDB default connection string causing errors with MySQLdb
                            
                                Does the Python Virtual Machine (CPython) convert bytecode into machine language?
                            
                                Why can't I change the list I'm iterating from when using yield
                            
                                Pytest unit test fails because target function has cachetools.ttl_cache decorator
                            
                                TypeError: replace() takes no keyword arguments on changing timezone
                            
                                PyCharm not exporting the correct requirements.txt
                            
                                Calculating gradient norm wrt weights with keras
                            
                                Seaborn: How to replace index with text in X-Axis in barplot?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With