Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Randomly split up elements from a stream of data without knowing the total number of elements

Given a "split ratio", I am trying to randomly split up a dataset into two groups. The catch is, that I do not know beforehand how many items the dataset contains. My library receives the data one by one from an input stream and is expected to return the data to two output streams. The resulting two datasets should ideally be exactly split into the given split ratio.

Illustration:

                            ┌─► stream A
 input stream ──► LIBRARY ──┤
                            └─► stream B

For example, given a split ratio of 30/70, stream A would be expected to receive 30% of the elements from the input stream and stream B the remaining 70%. The order must remain.


My ideas so far:

Idea 1: "Roll the dice" for each element

The obvious approach: For each element the algorithm randomly decides if the element should go into stream A or B. The problem is, that the resulting data sets might be far off from the expected split ratio. Given a split ratio of 50/50, the resulting data split might be something far off (could even be 100/0 for very small data sets). The goal is to keep the resulting split ratio as close as possible to the desired split ratio.

Idea 2: Use a cache and randomize the cached data

Another idea is to cache a fixed number of elements before passing them on. This would result in caching 1000 elements and shuffling the data (or their corresponding indices to keep the order stable), splitting them up and passing the resulting data sets on. This should work very well, but I'm unsure if the randomization is really random for large data sets (I imagine there will patterns when looking at the distribution).

Both algorithms are not optimal, so I hope you can help me.


Background

This is about a layer-based data science tool, where each layer receives data from the previous layer via a stream. This layer is expected to split the data (vectors) up into a training and test set before passing them on. The input data can range from just a few elements to a never ending stream of data (hence, the streams). The code is developed in JavaScript, but this question is more about the algorithm than the actual implementation.

like image 338
Thomas Dondorf Avatar asked Aug 13 '19 17:08

Thomas Dondorf


4 Answers

You could adjust the probability as it shifts away from the desired rate.

Here's an example along with tests for various levels of adjusting the probability. As we increase the adjustments, we see the stream splitter deviates less from the ideal ratio, but it also means its less random (knowing the previous values, you can predict the next values).

// rateStrictness = 0 will lead to "rolling the dice" for each invocations
// higher values of rateStrictness will lead to strong "correcting" forces
function* splitter(desiredARate, rateStrictness = .5) {
	let aCount = 0, bCount = 0;

	while (true) {

		let actualARate = aCount / (aCount + bCount);
		let aRate = desiredARate + (desiredARate - actualARate) * rateStrictness;
		if (Math.random() < aRate) {
			aCount++;
			yield 'a';
		} else {
			bCount++;
			yield 'b';
		}
	}
}

let test = (desiredARate, rateStrictness) => {
	let s = splitter(desiredARate, rateStrictness);
	let values = [...Array(1000)].map(() => s.next().value);
	let aCount = values.map((_, i) => values.reduce((count, v, j) => count + (v === 'a' && j <= i), 0));
	let aRate = aCount.map((c, i) => c / (i + 1));
	let deviation = aRate.map(a => a - desiredARate);
	let avgDeviation = deviation.reduce((sum, dev) => sum + dev, 0) / deviation.length;
	console.log(`inputs: desiredARate = ${desiredARate}; rateStrictness = ${rateStrictness}; average deviation = ${avgDeviation}`);
};

test(.5, 0);
test(.5, .25);
test(.5, .5);
test(.5, .75);
test(.5, 1);
test(.5, 10);
test(.5, 100);
like image 73
junvar Avatar answered Nov 09 '22 23:11

junvar


How about rolling the dice twice: First of all decide wether the stream should be chosen randomly or if the ratio should be taken into account. Then for the first case, roll the dice, for the second case take the ratio. Some pseudocode:

  const toA =
    Math.random() > 0.5 // 1 -> totally random, 0 -> totally equally distributed
      ? Math.random() > 0.7
      :  (numberA / (numberA + numberB) > 0.7);

That's just an idea I came up with, I haven't tried that ...

like image 22
Jonas Wilms Avatar answered Nov 10 '22 00:11

Jonas Wilms


Here is a way that combines both of your ideas: It uses a cache. As long as the amount of elements in cache can handle that if the stream ends, we can still approach target distribution, we just roll a dice. If not, we add it to the cache. When input stream ends, we shuffle elements in cache and send them trying to approach distribution. I am not sure if there is any gain in this over just forcing element to go to x if distribution is straying off too much in terms of randomness.

Beware that this approach does not preserve order from original input stream. A few other things could be added such as cache limit and relaxing distribution error (using 0 here). If you need to preserve order, it can be done by sending cache value and pushing to cache current one instead of just sending current one when there are still elements in cache.

let shuffle = (array) => array.sort(() => Math.random() - 0.5);

function* generator(numElements) {
  for (let i = 0; i < numElements;i++) yield i; 
}

function* splitter(aGroupRate, generator) {
  let cache = [];
  let sentToA = 0;
  let sentToB = 0;
  let bGroupRate = 1 - aGroupRate;
  let maxCacheSize = 0;
  
  let sendValue = (value, group) => {
      sentToA += group == 0;
      sentToB += group == 1;
      return {value: value, group: group};
  }
  
  function* retRandomGroup(value, expected) {
    while(Math.random() > aGroupRate != expected) {
      if (cache.length) {
        yield sendValue(cache.pop(), !expected);
      } else {
        yield sendValue(value, !expected);
        return;
      } 
    }
    yield sendValue(value, expected);
  }
  
  for (let value of generator) {
    if (sentToA + sentToB == 0) {
      yield sendValue(value, Math.random() > aGroupRate);
      continue;
    }
    
    let currentRateA = sentToA / (sentToA + sentToB);
        
    if (currentRateA <= aGroupRate) {
      // can we handle current value going to b group?
      if ((sentToA + cache.length) / (sentToB + sentToA + 1 + cache.length) >= aGroupRate) {
        for (val of retRandomGroup(value, 1)) yield val;
        continue;
      }
    }
    
    if (currentRateA > aGroupRate) {
      // can we handle current value going to a group?
      if (sentToA / (sentToB + sentToA + 1 + cache.length) <= aGroupRate) {
        for (val of retRandomGroup(value, 0)) yield val;
        continue;
      }
    }  
    
    cache.push(value);
    maxCacheSize = Math.max(maxCacheSize, cache.length)
  }
  
  shuffle(cache);
  
  let totalElements = sentToA + sentToB + cache.length;
  
  while (sentToA < totalElements * aGroupRate) {
    yield {value: cache.pop(), group: 0}
    sentToA += 1;
  }
  
  while (cache.length) {
    yield {value: cache.pop(), group: 1}
  }  
  
  yield {cache: maxCacheSize}
}

function test(numElements, aGroupRate) {
  let gen = generator(numElements);
  let sentToA = 0;
  let total = 0;
  let cacheSize = null;
  let split = splitter(aGroupRate, gen);
  for (let val of split) {
    if (val.cache != null) cacheSize = val.cache;
    else {
      sentToA += val.group == 0;
      total += 1
    }
  }
  console.log("required rate for A group", aGroupRate, "actual rate", sentToA / total, "cache size used", cacheSize);
}

test(3000, 0.3)
test(5000, 0.5)
test(7000, 0.7)
like image 31
juvian Avatar answered Nov 10 '22 01:11

juvian


Let's say you have to maintain a given ratio R for data items going to stream A, e.g. R = 0.3 as per your example. Then on receiving each data item count the total number of items and the items passed on to stream A and decide for each item if it goes to A based on what choice keeps you closer to your target ratio R.

That should be about the best you can do for any size of the data set. As for randomness the resulting streams A and B should be about as random as your input stream.

Let's see how this plays out for the first couple of iterations:

Example: R = 0.3

N : total number of items processed so far (initially 0)

A : numbers passed on to stream A so far (initially 0)

First Iteration

N = 0 ; A = 0 ; R = 0.3
if next item goes to stream A then 
    n = N + 1
    a = A + 1
    r = a / n = 1
else if next item goes to stream B
    n = N + 1
    a  = A
    r = a / n = 0

So first item goes to stream B since 0 is closer to 0.3

Second Iteration

N = 1 ; A = 0 ; R = 0.3
if next item goes to stream A then 
    n = N + 1
    a = A + 1
    r = a / n = 0.5
else if next item goes to stream B
    n = N + 1
    a = A
    r = a / n = 0

So second item goes to stream A since 0.5 is closer to 0.3

Third Iteration

N = 2 ; A = 1 ; R = 0.3
if next item goes to stream A then 
    n = N + 1
    a = A + 1
    r = a / n = 0.66
else if next item goes to stream B
    n = N + 1
    a = A
    r = a / n = 0.5

So third item goes to stream B since 0.5 is closer to 0.3

Fourth Iteration

N = 3 ; A = 1 ; R = 0.3
if next item goes to stream A then 
    n = N + 1
    a = A + 1
    r = a / n = 0.5
else if next item goes to stream B
    n = N + 1
    a = A
    r = a / n = 0.25

So third item goes to stream B since 0.25 is closer to 0.3

So this here would be the pseudo code for deciding each data item:

if (((A + 1) / (N + 1)) - R) < ((A / (N + 1)) - R ) then
    put the next data item on stream A
    A = A + 1
    N = N + 1
 else
    put the next data item on B
    N = N + 1

As discussed in the comments below, that is not random in the sense intended by the OP. So once we know the correct target stream for the next item we flip a coin to decide if we actually put it there, or introduce an error.

if (((A + 1) / (N + 1)) - R) < ((A / (N + 1)) - R ) then
    target_stream = A
else 
    target_stream = B

if random() < 0.5 then
    if target_stream == A then
        target_stream = B
    else
        target_stream = A

if target_stream == A then   
    put the next data item on stream A
    A = A + 1
    N = N + 1
 else
    put the next data item on B
    N = N + 1

Now that could lead to an arbitrarily large error overall. So we have to set an error limit L and check how far off the resulting ratio is from the target R when errors are about to be introduced:

if (((A + 1) / (N + 1)) - R) < ((A / (N + 1)) - R ) then
    target_stream = A
else 
    target_stream = B

if random() < 0.5 then
    if target_stream == A then
        if abs((A / (N + 1)) - R) < L then
            target_stream = B
    else
        if abs(((A + 1) / (N + 1)) - R) < L then
            target_stream = A

if target_stream == A then   
    put the next data item on stream A
    A = A + 1
    N = N + 1
 else
    put the next data item on B
    N = N + 1

So here we have it: Processing data items one by one we know the correct stream to put the next item on, then we introduce random local errors and we are able to limit the overall error with L.

like image 25
Gregor Ophey Avatar answered Nov 10 '22 01:11

Gregor Ophey