groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names. sort : Sort group keys.
What is the GroupBy function? Pandas' GroupBy is a powerful and versatile function in Python. It allows you to split your data into separate groups to perform computations for better analysis.
You can group DataFrame rows into a list by using pandas. DataFrame. groupby() function on the column of interest, select the column you want as a list from group and then use Series. apply(list) to get the list for every group.
Do it in 2 steps. First, create a dictionary.
>>> input = [('11013331', 'KAT'), ('9085267', 'NOT'), ('5238761', 'ETH'), ('5349618', 'ETH'), ('11788544', 'NOT'), ('962142', 'ETH'), ('7795297', 'ETH'), ('7341464', 'ETH'), ('9843236', 'KAT'), ('5594916', 'ETH'), ('1550003', 'ETH')]
>>> from collections import defaultdict
>>> res = defaultdict(list)
>>> for v, k in input: res[k].append(v)
...
Then, convert that dictionary into the expected format.
>>> [{'type':k, 'items':v} for k,v in res.items()]
[{'items': ['9085267', '11788544'], 'type': 'NOT'}, {'items': ['5238761', '5349618', '962142', '7795297', '7341464', '5594916', '1550003'], 'type': 'ETH'}, {'items': ['11013331', '9843236'], 'type': 'KAT'}]
It is also possible with itertools.groupby but it requires the input to be sorted first.
>>> sorted_input = sorted(input, key=itemgetter(1))
>>> groups = groupby(sorted_input, key=itemgetter(1))
>>> [{'type':k, 'items':[x[0] for x in v]} for k, v in groups]
[{'items': ['5238761', '5349618', '962142', '7795297', '7341464', '5594916', '1550003'], 'type': 'ETH'}, {'items': ['11013331', '9843236'], 'type': 'KAT'}, {'items': ['9085267', '11788544'], 'type': 'NOT'}]
Note both of these do not respect the original order of the keys. You need an OrderedDict if you need to keep the order.
>>> from collections import OrderedDict
>>> res = OrderedDict()
>>> for v, k in input:
... if k in res: res[k].append(v)
... else: res[k] = [v]
...
>>> [{'type':k, 'items':v} for k,v in res.items()]
[{'items': ['11013331', '9843236'], 'type': 'KAT'}, {'items': ['9085267', '11788544'], 'type': 'NOT'}, {'items': ['5238761', '5349618', '962142', '7795297', '7341464', '5594916', '1550003'], 'type': 'ETH'}]
Python's built-in itertools
module actually has a groupby
function , but for that the elements to be grouped must first be sorted such that the elements to be grouped are contiguous in the list:
from operator import itemgetter
sortkeyfn = itemgetter(1)
input = [('11013331', 'KAT'), ('9085267', 'NOT'), ('5238761', 'ETH'),
('5349618', 'ETH'), ('11788544', 'NOT'), ('962142', 'ETH'), ('7795297', 'ETH'),
('7341464', 'ETH'), ('9843236', 'KAT'), ('5594916', 'ETH'), ('1550003', 'ETH')]
input.sort(key=sortkeyfn)
Now input looks like:
[('5238761', 'ETH'), ('5349618', 'ETH'), ('962142', 'ETH'), ('7795297', 'ETH'),
('7341464', 'ETH'), ('5594916', 'ETH'), ('1550003', 'ETH'), ('11013331', 'KAT'),
('9843236', 'KAT'), ('9085267', 'NOT'), ('11788544', 'NOT')]
groupby
returns a sequence of 2-tuples, of the form (key, values_iterator)
. What we want is to turn this into a list of dicts where the 'type' is the key, and 'items' is a list of the 0'th elements of the tuples returned by the values_iterator. Like this:
from itertools import groupby
result = []
for key,valuesiter in groupby(input, key=sortkeyfn):
result.append(dict(type=key, items=list(v[0] for v in valuesiter)))
Now result
contains your desired dict, as stated in your question.
You might consider, though, just making a single dict out of this, keyed by type, and each value containing the list of values. In your current form, to find the values for a particular type, you'll have to iterate over the list to find the dict containing the matching 'type' key, and then get the 'items' element from it. If you use a single dict instead of a list of 1-item dicts, you can find the items for a particular type with a single keyed lookup into the master dict. Using groupby
, this would look like:
result = {}
for key,valuesiter in groupby(input, key=sortkeyfn):
result[key] = list(v[0] for v in valuesiter)
result
now contains this dict (this is similar to the intermediate res
defaultdict in @KennyTM's answer):
{'NOT': ['9085267', '11788544'],
'ETH': ['5238761', '5349618', '962142', '7795297', '7341464', '5594916', '1550003'],
'KAT': ['11013331', '9843236']}
(If you want to reduce this to a one-liner, you can:
result = dict((key,list(v[0] for v in valuesiter)
for key,valuesiter in groupby(input, key=sortkeyfn))
or using the newfangled dict-comprehension form:
result = {key:list(v[0] for v in valuesiter)
for key,valuesiter in groupby(input, key=sortkeyfn)}
This answer is similar to @PaulMcG's answer but doesn't require sorting the input.
For those into functional programming, groupBy
can be written in one line (not including imports!), and unlike itertools.groupby
it doesn't require the input to be sorted:
from functools import reduce # import needed for python3; builtin in python2
from collections import defaultdict
def groupBy(key, seq):
return reduce(lambda grp, val: grp[key(val)].append(val) or grp, seq, defaultdict(list))
(The reason for ... or grp
in the lambda
is that for this reduce()
to work, the lambda
needs to return its first argument; because list.append()
always returns None
the or
will always return grp
. I.e. it's a hack to get around python's restriction that a lambda can only evaluate a single expression.)
This returns a dict whose keys are found by evaluating the given function and whose values are a list of the original items in the original order. For the OP's example, calling this as groupBy(lambda pair: pair[1], input)
will return this dict:
{'KAT': [('11013331', 'KAT'), ('9843236', 'KAT')],
'NOT': [('9085267', 'NOT'), ('11788544', 'NOT')],
'ETH': [('5238761', 'ETH'), ('5349618', 'ETH'), ('962142', 'ETH'), ('7795297', 'ETH'), ('7341464', 'ETH'), ('5594916', 'ETH'), ('1550003', 'ETH')]}
And as per @PaulMcG's answer the OP's requested format can be found by wrapping that in a list comprehension. So this will do it:
result = {key: [pair[0] for pair in values],
for key, values in groupBy(lambda pair: pair[1], input).items()}
I also liked pandas simple grouping. it's powerful, simple and most adequate for large data set
result = pandas.DataFrame(input).groupby(1).groups
The following function will quickly (no sorting required) group tuples of any length by a key having any index:
# given a sequence of tuples like [(3,'c',6),(7,'a',2),(88,'c',4),(45,'a',0)],
# returns a dict grouping tuples by idx-th element - with idx=1 we have:
# if merge is True {'c':(3,6,88,4), 'a':(7,2,45,0)}
# if merge is False {'c':((3,6),(88,4)), 'a':((7,2),(45,0))}
def group_by(seqs,idx=0,merge=True):
d = dict()
for seq in seqs:
k = seq[idx]
v = d.get(k,tuple()) + (seq[:idx]+seq[idx+1:] if merge else (seq[:idx]+seq[idx+1:],))
d.update({k:v})
return d
In the case of your question, the index of key you want to group by is 1, therefore:
group_by(input,1)
gives
{'ETH': ('5238761','5349618','962142','7795297','7341464','5594916','1550003'),
'KAT': ('11013331', '9843236'),
'NOT': ('9085267', '11788544')}
which is not exactly the output you asked for, but might as well suit your needs.
result = []
# Make a set of your "types":
input_set = set([tpl[1] for tpl in input])
>>> set(['ETH', 'KAT', 'NOT'])
# Iterate over the input_set
for type_ in input_set:
# a dict to gather things:
D = {}
# filter all tuples from your input with the same type as type_
tuples = filter(lambda tpl: tpl[1] == type_, input)
# write them in the D:
D["type"] = type_
D["itmes"] = [tpl[0] for tpl in tuples]
# append D to results:
result.append(D)
result
>>> [{'itmes': ['9085267', '11788544'], 'type': 'NOT'}, {'itmes': ['5238761', '5349618', '962142', '7795297', '7341464', '5594916', '1550003'], 'type': 'ETH'}, {'itmes': ['11013331', '9843236'], 'type': 'KAT'}]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With