Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithims for Selecting Unique Combinations

I'm looking for algorithims or methods that would be useful for solving this problem:

Say I have a dataframe:

       x         y    Bin1  Bin2      Bin3 
153.0303 -27.17894      10         6         5        
153.0303 -27.17916       8         7         8        
153.0303 -27.17938       1         6         3        
153.0300 -27.17960      10         1         8     

That goes on for ~10k rows. Each Bin can be an integer 1 to 10. What I'm trying to do is select a random subset where each Bin only has unique values. E.g this dataframe would be valid, as each Bin has 10 distinct values.

       x         y    Bin1  Bin2      Bin3 
153.0303 -27.17894       1         6         4        
153.0303 -27.17916       2         7         2        
153.0303 -27.17938       3         5         3        
153.0300 -27.17960       4         3         8    
153.0303 -27.17938       5         4         1        
153.0300 -27.17960       6         8         7  
153.0303 -27.17938       7         1         6        
153.0300 -27.17960       8         2        10  
153.0303 -27.17938       9         10        5        
153.0300 -27.17960      10         9         9   

My current method involves randomly selecting rows repeatedly until I find a combination. However I'm trying to find a more efficient method.

Thanks in advance!

like image 921
Loleman Avatar asked Mar 02 '26 17:03

Loleman


1 Answers

Idea: We keep track of two indices: current, and last. Anything before current is part of the set we're building, and anything after last has been eliminated as incompatible with something in our set.

We also keep a set of invalid values for each bin.

initialize with current = 1 and last = n, the number of rows.

  1. swap row[current] with a random row between current & last inclusive.
  2. If the swapped row is valid (check vs your sets) then update your sets, increment current, and go back to 1 or stop if you've hit your target.
  3. If the swapped row isn't valid then swap it with the last row, decrement last, and go back to 1.

In the worst case, you've asked for something impossible. I.e., you want k valid rows but the largest possible set of valid rows is < k. In this case, you will run through this loop n times.

It's also possible that you ask for k valid rows, and they exist, but the k-1 or fewer rows you've already selected are incompatible with any row additions.

So whether this approach is reasonable depends on your data.

like image 140
Dave Avatar answered Mar 05 '26 07:03

Dave



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!