Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Determinism of Java 8 streams

Motivation

I've just rewritten some 30 mostly trivial parsers and I need that the new versions behave exactly like the old ones. Therefore, I stored their example input files and some signature of the outputs produced by the old parsers for comparison with the new ones. This signature contains the counts of successfully parsed items, sums of some hash codes and up to 10 pseudo-randomly chosen items.

I thought this was a good idea as the equality of the hash code sums sort of guarantee that the outputs are exactly the same and the samples allow me to see what's wrong. I'm only using samples as otherwise it'd get really big.

The problem

Basically, given an unordered collection of strings, I want to get a list of up to 10 of them, so that when the collection changes a bit, I still get mostly the same samples in the same positions (the input is unordered, but the output is a list). This should work also when something is missing, so ideas like taking the 100th smallest element don't work.

ImmutableList<String> selectSome(Collection<String> list) {
        if (list.isEmpty()) return ImmutableList.of();
        return IntStream.range(1, 20)
            .mapToObj(seed -> selectOne(list, seed))
            .distinct()
            .limit(10)
            .collect(ImmutableList.toImmutableList());
    }

So I start with numbers from 1 to 20 (so that after distinct I still most probably have my 10 samples), call a stateless deterministic function selectOne (defined below) returning one string which is maximal according to some funny criteria, remove duplicates, limit the result and collect it using Guava. All steps should be IMHO deterministic and "ordered", but I may be overlooking something. The other possibility would be that all my 30 new parsers are wrong, but this is improbable given that the hashes are correct. Moreover, the results of the parsing look correct.

String selectOne(Collection<String> list, int seed) {
    // some boring mixing, definitely deterministic
    for (int i=0; i<10; ++i) {
        seed *= 123456789;
        seed = Integer.rotateLeft(seed, 16);
    }
    // ensure seed is odd
    seed = 2*seed + 1;

    // first element is the candidate result
    String result = list.iterator().next();
    // the value is the hash code multiplied by the seed
    // overflow is fine
    int value = seed * result.hashCode();

    // looking for s maximizing seed * s.hashCode()
    for (final String s : list) {
        final int v = seed * s.hashCode();
        if (v < value) continue;
        // tiebreaking by taking the bigger or smaller s
        // this is needed for determinism
        if (s.compareTo(result) * seed < 0) continue;
        result = s;
        value = v;
    }
    return result;
}

This sampling doesn't seem to work. I get a sequence like

"9224000", "9225000", "4165000", "9200000", "7923000", "8806000", ...

with one old parser and

"9224000", "9225000", "4165000", "3030000", "1731000", "8806000", ...

with a new one. Both results are perfectly repeatable. For other parsers, it looks very similar.

Is my usage of streams wrong? Do I have to add .sequential() or alike?

Update

Sorting the input collection has solved the problem:

ImmutableList<String> selectSome(Collection<String> collection) {
    final List<String> list = Lists.newArrayList(collection);
    Collections.sort(list);
    .... as before
}

What's still missing is an explanation why.

The explanation

As stated in the answers, my tiebreaker was an all-breaker as I missed to check for a tie. Something like

if (v==value && s.compareTo(result) < 0) continue;

works fine.

I hope that my confused question may be at least useful for someone looking for "consistent sampling". It wasn't really Java 8 related.

I should've used Guava ComparisonChain or better Java 8 arg max to avoid my stupid mistake:

String selectOne(Collection<String> list, int seed) {
    .... as before
    final int multiplier = 2*seed + 1;
    return list.stream()
          .max(Comparator.comparingInt(s -> multiplier * s.hashCode())
          .thenComparing(s -> s)) // <--- FOOL-PROOF TIEBREAKER
          .get();
}
like image 288
maaartinus Avatar asked Aug 25 '17 06:08

maaartinus


2 Answers

The mistake is that your tiebreaker is not in fact breaking a tie. We should be selecting s when v > value, but instead we're falling back to compareTo(). This breaks comparison symmetry, making your algorithm dependent on encounter order.

As a bonus, here's a simple test case to reproduce the bug:

System.out.println(selectOne(Arrays.asList("1", "2"), 4));  // 1
System.out.println(selectOne(Arrays.asList("2", "1"), 4));  // 2
like image 82
shmosel Avatar answered Sep 27 '22 19:09

shmosel


In selectOne you just want to select String s with max rank of value = seed * s.hashCode(); for that given seed.

The problem is with the "tiebreaking" line: if (s.compareTo(result) * seed < 0) continue;

It is not deterministic - for different order of elements it omits different elements from being check, and thus change in order of elements is changing the result.

Remove the tiebreaking if and the result will be insensitive to the order of elements in input list.

like image 45
Krzysztof Cichocki Avatar answered Sep 27 '22 18:09

Krzysztof Cichocki