Algorithims for Selecting Unique Combinations

Question

I'm looking for algorithims or methods that would be useful for solving this problem:

Say I have a dataframe:

       x         y    Bin1  Bin2      Bin3 
153.0303 -27.17894      10         6         5        
153.0303 -27.17916       8         7         8        
153.0303 -27.17938       1         6         3        
153.0300 -27.17960      10         1         8

That goes on for ~10k rows. Each Bin can be an integer 1 to 10. What I'm trying to do is select a random subset where each Bin only has unique values. E.g this dataframe would be valid, as each Bin has 10 distinct values.

       x         y    Bin1  Bin2      Bin3 
153.0303 -27.17894       1         6         4        
153.0303 -27.17916       2         7         2        
153.0303 -27.17938       3         5         3        
153.0300 -27.17960       4         3         8    
153.0303 -27.17938       5         4         1        
153.0300 -27.17960       6         8         7  
153.0303 -27.17938       7         1         6        
153.0300 -27.17960       8         2        10  
153.0303 -27.17938       9         10        5        
153.0300 -27.17960      10         9         9

My current method involves randomly selecting rows repeatedly until I find a combination. However I'm trying to find a more efficient method.

Thanks in advance!

Dave · Accepted Answer

Idea: We keep track of two indices: current, and last. Anything before current is part of the set we're building, and anything after last has been eliminated as incompatible with something in our set.

We also keep a set of invalid values for each bin.

initialize with current = 1 and last = n, the number of rows.

swap row[current] with a random row between current & last inclusive.
If the swapped row is valid (check vs your sets) then update your sets, increment current, and go back to 1 or stop if you've hit your target.
If the swapped row isn't valid then swap it with the last row, decrement last, and go back to 1.

In the worst case, you've asked for something impossible. I.e., you want k valid rows but the largest possible set of valid rows is < k. In this case, you will run through this loop n times.

It's also possible that you ask for k valid rows, and they exist, but the k-1 or fewer rows you've already selected are incompatible with any row additions.

So whether this approach is reasonable depends on your data.

Algorithims for Selecting Unique Combinations

Tags:

python

algorithm

dataframe

Loleman

1 Answers

Dave

Recent Activity

Donate For Us

Algorithims for Selecting Unique Combinations

Tags:

python

algorithm

dataframe

Loleman

1 Answers

Dave

Related questions

Recent Activity

Donate For Us