I often want to bucket an unordered collection in python. itertools.groubpy
does the right sort of thing but almost always requires massaging to sort the items first and catch the iterators before they're consumed.
Is there any quick way to get this behavior, either through a standard python module or a simple python idiom?
>>> bucket('thequickbrownfoxjumpsoverthelazydog', lambda x: x in 'aeiou')
{False: ['t', 'h', 'q', 'c', 'k', 'b', 'r', 'w', 'n', 'f', 'x', 'j', 'm', 'p',
's', 'v', 'r', 't', 'h', 'l', 'z', 'y', 'd', 'g'],
True: ['e', 'u', 'i', 'o', 'o', 'u', 'o', 'e', 'e', 'a', 'o']}
>>> bucket(xrange(21), lambda x: x % 10)
{0: [0, 10, 20],
1: [1, 11],
2: [2, 12],
3: [3, 13],
4: [4, 14],
5: [5, 15],
6: [6, 16],
7: [7, 17],
8: [8, 18],
9: [9, 19]}
This has come up several times before -- (1), (2), (3) -- and there's a partition recipe in the itertools
recipes, but to my knowledge there's nothing in the standard library.. although I was surprised a few weeks ago by accumulate
, so who knows what's lurking there these days? :^)
When I need this behaviour, I use
from collections import defaultdict
def partition(seq, key):
d = defaultdict(list)
for x in seq:
d[key(x)].append(x)
return d
and get on with my day.
Here is a simple two liner
d = {}
for x in "thequickbrownfoxjumpsoverthelazydog": d.setdefault(x in 'aeiou', []).append(x)
Edit:
Just adding your other case for completeness.
d={}
for x in xrange(21): d.setdefault(x%10, []).append(x)
Here's a variant of partition()
from above when the predicate is boolean, avoiding the cost of a dict
/defaultdict
:
def boolpartition(seq, pred):
passing, failing = [], []
for item in seq:
(passing if pred(item) else failing).append(item)
return passing, failing
Example usage:
>>> even, odd = boolpartition([1, 2, 3, 4, 5], lambda x: x % 2 == 0)
>>> even
[2, 4]
>>> odd
[1, 3, 5]
If its a pandas.DataFrame
the following also works, utilizing pd.cut()
from sklearn import datasets
import pandas as pd
# import some data to play with
iris = datasets.load_iris()
df_data = pd.DataFrame(iris.data[:,0]) # we'll just take the first feature
# bucketize
n_bins = 5
feature_name = iris.feature_names[0].replace(" ", "_")
my_labels = [str(feature_name) + "_" + str(num) for num in range(0,n_bins)]
pd.cut(df_data[0], bins=n_bins, labels=my_labels)
yielding
0 0_1
1 0_0
2 0_0
[...]
In case you don't set the labels
, the output is going to like this
0 (5.02, 5.74]
1 (4.296, 5.02]
2 (4.296, 5.02]
[...]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With