I'm looking for algorithims or methods that would be useful for solving this problem:
Say I have a dataframe:
x y Bin1 Bin2 Bin3
153.0303 -27.17894 10 6 5
153.0303 -27.17916 8 7 8
153.0303 -27.17938 1 6 3
153.0300 -27.17960 10 1 8
That goes on for ~10k rows. Each Bin can be an integer 1 to 10. What I'm trying to do is select a random subset where each Bin only has unique values. E.g this dataframe would be valid, as each Bin has 10 distinct values.
x y Bin1 Bin2 Bin3
153.0303 -27.17894 1 6 4
153.0303 -27.17916 2 7 2
153.0303 -27.17938 3 5 3
153.0300 -27.17960 4 3 8
153.0303 -27.17938 5 4 1
153.0300 -27.17960 6 8 7
153.0303 -27.17938 7 1 6
153.0300 -27.17960 8 2 10
153.0303 -27.17938 9 10 5
153.0300 -27.17960 10 9 9
My current method involves randomly selecting rows repeatedly until I find a combination. However I'm trying to find a more efficient method.
Thanks in advance!
Idea: We keep track of two indices: current, and last. Anything before current is part of the set we're building, and anything after last has been eliminated as incompatible with something in our set.
We also keep a set of invalid values for each bin.
initialize with current = 1 and last = n, the number of rows.
In the worst case, you've asked for something impossible. I.e., you want k valid rows but the largest possible set of valid rows is < k. In this case, you will run through this loop n times.
It's also possible that you ask for k valid rows, and they exist, but the k-1 or fewer rows you've already selected are incompatible with any row additions.
So whether this approach is reasonable depends on your data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With