Python Pandas Choosing Random Sample of Groups from Groupby

Tags:

What is the best way to get a random sample of the elements of a groupby? As I understand it, a groupby is just an iterable over groups.

The standard way I would do this for an iterable, if I wanted to select N = 200 elements is:

rand = random.sample(data, N)

If you attempt the above where data is a 'grouped' the elements of the resultant list are tuples for some reason.

I found the below example for randomly selecting the elements of a single key groupby, however this does not work with a multi-key groupby. From, How to access pandas groupby dataframe by key

create groupby object
grouped = df.groupby('some_key')
pick N dataframes and grab their indices
sampled_df_i = random.sample(grouped.indices, N)
grab the groups using the groupby object 'get_group' method
df_list = map(lambda df_i: grouped.get_group(df_i),sampled_df_i)
optionally - turn it all back into a single dataframe object
sampled_df = pd.concat(df_list, axis=0, join='outer')

485

asked Sep 01 '15 20:09

sfortney

2 Answers

I feel like lower-level numpy operations are cleaner:

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "some_key": [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3],
        "val": [1, 2, 3, 4, 1, 5, 1, 5, 1, 6, 7, 8],
    }
)

ids = df["some_key"].unique()
ids = np.random.choice(ids, size=2, replace=False)
ids

# > array([3, 2])

df.loc[df["some_key"].isin(ids)]

# >     some_key  val
# 2          2    3
# 3          3    4
# 6          2    1
# 7          3    5
# 10         2    7
# 11         3    8

145

answered Oct 07 '22 00:10

jsta

You can take a randoms sample of the unique values of df.some_key.unique(), use that to slice the df and finally groupby on the resultant:

In [337]:

df = pd.DataFrame({'some_key': [0,1,2,3,0,1,2,3,0,1,2,3],
                   'val':      [1,2,3,4,1,5,1,5,1,6,7,8]})
In [338]:

print df[df.some_key.isin(random.sample(df.some_key.unique(),2))].groupby('some_key').mean()
               val
some_key          
0         1.000000
2         3.666667

If there are more than one groupby keys:

In [358]:

df = pd.DataFrame({'some_key1':[0,1,2,3,0,1,2,3,0,1,2,3],
                   'some_key2':[0,0,0,0,1,1,1,1,2,2,2,2],
                   'val':      [1,2,3,4,1,5,1,5,1,6,7,8]})
In [359]:

gby = df.groupby(['some_key1', 'some_key2'])
In [360]:

print gby.mean().ix[random.sample(gby.indices.keys(),2)]
                     val
some_key1 some_key2     
1         1            5
3         2            8

But if you are just going to get the values of each group, you don't even need to groubpy, MultiIndex will do:

In [372]:

idx = random.sample(set(pd.MultiIndex.from_product((df.some_key1, df.some_key2)).tolist()),
                    2)
print df.set_index(['some_key1', 'some_key2']).ix[idx]
                     val
some_key1 some_key2     
2         0            3
3         1            5

answered Oct 07 '22 00:10

CT Zhu

Related questions
                            
                                Importing financial data into Python Pandas using read_csv
                            
                                Chunking Stanford Named Entity Recognizer (NER) outputs from NLTK format
                            
                                pip: Why sometimes installed as egg, sometimes installed as files
                            
                                Monkey patching with a partial function [duplicate]
                            
                                Cutting SciPy hierarchical dendrogram into clusters via a threshold value
                            
                                Calling a coroutine from asyncio.Protocol.data_received
                            
                                Autocommit Migration from Django 1.7 to 1.8
                            
                                Python argparse: Insert blank line between help entries
                            
                                The best way to merge multi-nested dictionaries in Python 2.7
                            
                                Removing space in dataframe python
                            
                                String in Cython functions
                            
                                Any __future__ import for range-xrange incompatibility?
                            
                                Get Python Tornado Version?
                            
                                Displaying numbers with "X" instead of "e" scientific notation in matplotlib
                            
                                can't compare offset-naive and offset-aware datetimes - last_seen option [duplicate]
                            
                                How to plot a ylabel per subplot using pandas DataFrame plot function
                            
                                Concatenating Unicode with string: print '£' + '1' works, but print '£' + u'1' throws UnicodeDecodeError
                            
                                PIL/Pillow decode icc profile information
                            
                                pip install vs. conda install
                            
                                Put multiple items in a python queue

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python Pandas Choosing Random Sample of Groups from Groupby

Tags:

python

random

pandas

group-by

sfortney

People also ask

2 Answers

jsta

CT Zhu

Recent Activity

Donate For Us