I have an example dataset that I want to groupby one column and then produce 4 new columns based on all of the values of existing columns.
Here is some sample data:
data = {'AlignmentId': {0: u'ENSMUST00000000001.4-1',
1: u'ENSMUST00000000001.4-1',
2: u'ENSMUST00000000003.13-0',
3: u'ENSMUST00000000003.13-0',
4: u'ENSMUST00000000003.13-0'},
'name': {0: u'NonCodingDeletion',
1: u'NonCodingInsertion',
2: u'CodingDeletion',
3: u'CodingInsertion',
4: u'NonCodingDeletion'},
'value_CDS': {0: nan, 1: nan, 2: 1.0, 3: 1.0, 4: nan},
'value_mRNA': {0: 21.0, 1: 26.0, 2: 1.0, 3: 1.0, 4: 2.0}}
df = pd.DataFrame.from_dict(data)
Which looks like this:
AlignmentId name value_mRNA value_CDS
0 ENSMUST00000000001.4-1 NonCodingDeletion 21.0 NaN
1 ENSMUST00000000001.4-1 NonCodingInsertion 26.0 NaN
2 ENSMUST00000000003.13-0 CodingDeletion 1.0 1.0
3 ENSMUST00000000003.13-0 CodingInsertion 1.0 1.0
4 ENSMUST00000000003.13-0 NonCodingDeletion 2.0 NaN
I want to return booleans based on the presence/absence of values in the name
column depending on whether the value_CDS
contains only null values. I produced this function to do so:
def aggfunc(s):
if s.value_CDS.any():
c = set(s.name)
else:
c = set(s.name)
return ('CodingDeletion' in c or 'CodingInsertion' in c,
'CodingInsertion' in c, 'CodingDeletion' in c,
'CodingMult3Deletion' in c or 'CodingMult3Insertion' in c)
And did this:
merged = df.groupby('AlignmentId').aggregate(aggfunc)
Which gives me the error ValueError: Shape of passed values is (318, 4), indices imply (318, 3)
.
How can I return multiple new columns from an groupby-aggregate?
The output I am looking for is:
ENSMUST00000000001.4-1 (False, False, False, False)
ENSMUST00000000003.13-0 (True, True, True, False)
Which I would then ideally put into a 5-column dataframe.
If I use .apply
, the output is incorrect:
ENSMUST00000000001.4-1 (False, False, False, False)
ENSMUST00000000003.13-0 (False, False, False, False)
But if I grab the groups one at a time, it is correct:
In [380]: for aln_id, d in df.groupby('AlignmentId'):
.....: print aggfunc(d)
.....:
(False, False, False, False)
(True, True, True, False)
To apply aggregations to multiple columns, just add additional key:value pairs to the dictionary. Applying multiple aggregation functions to a single column will result in a multiindex. Working with multi-indexed columns is a pain and I'd recommend flattening this after aggregating by renaming the new columns.
How to groupby multiple columns in pandas DataFrame and compute multiple aggregations? groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.
Using GroupBy on a Pandas DataFrame is overall simple: we first need to group the data according to one or more columns ; we'll then apply some aggregation function / logic, being it mix, max, sum, mean / average etc'.
You need change name
to ['name']
, because .name
return name of group (value of column grouping by):
def aggfunc(s):
if s.value_CDS.any():
c = set(s['name'])
else:
c = set(s['name'])
return ('CodingDeletion' in c or 'CodingInsertion' in c,
'CodingInsertion' in c, 'CodingDeletion' in c,
'CodingMult3Deletion' in c or 'CodingMult3Insertion' in c)
merged = df.groupby('AlignmentId').apply(aggfunc)
print (merged)
AlignmentId
ENSMUST00000000001.4-1 (False, False, False, False)
ENSMUST00000000003.13-0 (True, True, True, False)
dtype: object
def aggfunc(s):
print ('Name of group is: {}'.format((s.name)))
print ('Column name is:\n {}'.format(s['name']))
merged = df.groupby('AlignmentId').apply(aggfunc)
print (merged)
Name of group is: ENSMUST00000000001.4-1
Column name is:
0 NonCodingDeletion
1 NonCodingInsertion
Name: name, dtype: object
Name of group is: ENSMUST00000000001.4-1
Column name is:
0 NonCodingDeletion
1 NonCodingInsertion
Name: name, dtype: object
Name of group is: ENSMUST00000000003.13-0
Column name is:
2 CodingDeletion
3 CodingInsertion
4 NonCodingDeletion
Name: name, dtype: object
Improved code:
def aggfunc(s):
#if and else return same c, so omitted
c = set(s['name'])
#added Series for return columns instead tuples
cols = ['col1','col2','col3','col4']
return pd.Series(('CodingDeletion' in c or 'CodingInsertion' in c,
'CodingInsertion' in c, 'CodingDeletion' in c,
'CodingMult3Deletion' in c or 'CodingMult3Insertion' in c), index=cols)
merged = df.groupby('AlignmentId').apply(aggfunc)
print (merged)
col1 col2 col3 col4
AlignmentId
ENSMUST00000000001.4-1 False False False False
ENSMUST00000000003.13-0 True True True False
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With