In numpy I have a dataset like this. The first two columns are indices. I can divide my dataset into blocks via the indices, i.e. first block is 0 0 second block is 0 1 third block 0 2 then 1 0, 1 1, 1 2 and so on and so forth. Each block has at least two elements. The numbers in the indices columns can vary I need to split the dataset along these blocks 80%-20% randomly such that after the split each block in both datasets has at least 1 element. How could I do that? <pre class="prettyprint"><code>indices | real data | 0 0 | 43.25 665.32 ... } 1st block 0 0 | 11.234 } 0 1 ... } 2nd block 0 1 } 0 2 } 3rd block 0 2 } 1 0 } 4th block 1 0 } 1 0 } 1 1 ... 1 1 1 2 1 2 2 0 2 0 2 1 2 1 2 1 ... </code></pre>

See how do you like this. To introduce randomness, I am shuffling the entire dataset. It is the only way I have figured how to do the splitting vectorized. Maybe you could simply shuffle an indexing array, but that was one indirection too many for my brain today. I have also used a structured array, for ease in extracting the blocks. First, lets create a sample dataset: <pre class="prettyprint"><code>from __future__ import division import numpy as np # Create a sample data set c1, c2 = 10, 5 idx1, idx2 = np.arange(c1), np.arange(c2) idx1, idx2 = np.repeat(idx1, c2), np.tile(idx2, c1) items = 1000 i = np.random.randint(c1*c2, size=(items - 2*c1*c2,)) d = np.random.rand(items+5) dataset = np.empty((items+5,), [('idx1', np.int), ('idx2', np.int), ('data', np.float)]) dataset['idx1'][:2*c1*c2] = np.tile(idx1, 2) dataset['idx1'][2*c1*c2:-5] = idx1[i] dataset['idx2'][:2*c1*c2] = np.tile(idx2, 2) dataset['idx2'][2*c1*c2:-5] = idx2[i] dataset['data'] = d # Add blocks with only 2 and only 3 elements to test corner case dataset['idx1'][-5:] = -1 dataset['idx2'][-5:] = [0] * 2 + [1]*3 </code></pre> And now the stratified sampling: <pre class="prettyprint"><code># For randomness, shuffle the entire array np.random.shuffle(dataset) blocks, _ = np.unique(dataset[['idx1', 'idx2']], return_inverse=True) block_count = np.bincount(_) where = np.argsort(_) block_start = np.concatenate(([0], np.cumsum(block_count)[:-1])) # If we have n elements in a block, and we assign 1 to each array, we # are left with only n-2. If we randomly assign a fraction x of these # to the first array, the expected ratio of items will be # (x*(n-2) + 1) : ((1-x)*(n-2) + 1) # Setting the ratio equal to 4 (80/20) and solving for x, we get # x = 4/5 + 3/5/(n-2) x = 4/5 + 3/5/(block_count - 2) x = np.clip(x, 0, 1) # if n in (2, 3), the ratio is larger than 1 threshold = np.repeat(x, block_count) threshold[block_start] = 1 # first item goes to A threshold[block_start + 1] = 0 # seconf item goes to B a_idx = threshold > np.random.rand(len(dataset)) A = dataset[where[a_idx]] B = dataset[where[~a_idx]] </code></pre> After running it, the split is roughly 80/20, and all blocks are represented in both arrays: <pre class="prettyprint"><code>>>> len(A) 815 >>> len(B) 190 >>> np.all(np.unique(A[['idx1', 'idx2']]) == np.unique(B[['idx1', 'idx2']])) True </code></pre>

stratified sampling in numpy

Tags:

python

numpy

In numpy I have a dataset like this. The first two columns are indices. I can divide my dataset into blocks via the indices, i.e. first block is 0 0 second block is 0 1 third block 0 2 then 1 0, 1 1, 1 2 and so on and so forth. Each block has at least two elements. The numbers in the indices columns can vary

I need to split the dataset along these blocks 80%-20% randomly such that after the split each block in both datasets has at least 1 element. How could I do that?

indices | real data
        |
0   0   | 43.25 665.32 ...  } 1st block
0   0   | 11.234            }
0   1     ...               } 2nd block
0   1                       } 
0   2                       } 3rd block
0   2                       }
1   0                       } 4th block
1   0                       }
1   0                       }
1   1                       ...
1   1                       
1   2
1   2
2   0
2   0 
2   1
2   1
2   1
...

616

asked Apr 05 '13 16:04

siamii

1 Answers

See how do you like this. To introduce randomness, I am shuffling the entire dataset. It is the only way I have figured how to do the splitting vectorized. Maybe you could simply shuffle an indexing array, but that was one indirection too many for my brain today. I have also used a structured array, for ease in extracting the blocks. First, lets create a sample dataset:

from __future__ import division
import numpy as np

# Create a sample data set
c1, c2 = 10, 5
idx1, idx2 = np.arange(c1), np.arange(c2)
idx1, idx2 = np.repeat(idx1, c2), np.tile(idx2, c1)

items = 1000
i = np.random.randint(c1*c2, size=(items - 2*c1*c2,))
d = np.random.rand(items+5)

dataset = np.empty((items+5,), [('idx1', np.int), ('idx2', np.int),
                             ('data', np.float)])
dataset['idx1'][:2*c1*c2] =  np.tile(idx1, 2)
dataset['idx1'][2*c1*c2:-5] = idx1[i]
dataset['idx2'][:2*c1*c2] = np.tile(idx2, 2)
dataset['idx2'][2*c1*c2:-5] = idx2[i]
dataset['data'] = d
# Add blocks with only 2 and only 3 elements to test corner case
dataset['idx1'][-5:] = -1
dataset['idx2'][-5:] = [0] * 2 + [1]*3

And now the stratified sampling:

# For randomness, shuffle the entire array
np.random.shuffle(dataset)

blocks, _ = np.unique(dataset[['idx1', 'idx2']], return_inverse=True)
block_count = np.bincount(_)
where = np.argsort(_)
block_start = np.concatenate(([0], np.cumsum(block_count)[:-1]))

# If we have n elements in a block, and we assign 1 to each array, we
# are left with only n-2. If we randomly assign a fraction x of these
# to the first array, the expected ratio of items will be
# (x*(n-2) + 1) : ((1-x)*(n-2) + 1)
# Setting the ratio equal to 4 (80/20) and solving for x, we get
# x = 4/5 + 3/5/(n-2)

x = 4/5 + 3/5/(block_count - 2)
x = np.clip(x, 0, 1) # if n in (2, 3), the ratio is larger than 1
threshold = np.repeat(x, block_count)
threshold[block_start] = 1 # first item goes to A
threshold[block_start + 1] = 0 # seconf item goes to B

a_idx = threshold > np.random.rand(len(dataset))

A = dataset[where[a_idx]]
B = dataset[where[~a_idx]]

After running it, the split is roughly 80/20, and all blocks are represented in both arrays:

>>> len(A)
815
>>> len(B)
190
>>> np.all(np.unique(A[['idx1', 'idx2']]) == np.unique(B[['idx1', 'idx2']]))
True

141

answered Oct 15 '22 22:10

Jaime

Related questions
                            
                                Django-compressor / django-storages links being wrongly cached; expiring
                            
                                Enable Monetization on YouTube video using YouTube API
                            
                                I'm confused by this code
                            
                                In Django admin, include auth.User as an inline
                            
                                python and gtk3 clipboard onChange
                            
                                How to match accented characters with a regex in Python?
                            
                                Django Filter Base Class by Child Class Names
                            
                                how is cherrypy working? it handls requests well compared with tornado when concurrence is low
                            
                                Pygame keyboard layouts mixed up
                            
                                Elegant way to try/except a series of BeautifulSoup commands?
                            
                                Python Celery task to restart celery worker
                            
                                How to find hidden files inside image files (Jpg/Gif/Png) [closed]
                            
                                Operations on two Lists
                            
                                how to package a django project?
                            
                                Python select with sockets and sys.stdin
                            
                                Handle HTML Form Data with Python?
                            
                                Where is the huey consumer configuration?
                            
                                Importing a CSV file into a PostgreSQL DB using Python-Django
                            
                                Why `setup.py install` does not update the script file?
                            
                                Encoding custom python objects as BSON with pymongo

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With