Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Computing a ranking with Java 8 Stream API

A common problem when processing data is to build a ranking of items based on some property.

The ranking operation consists in mapping the items to a set of ordinal numbers (called ranking). In presence of ties (i.e. two items having the same ranking) several strategies can be used, in this context we will assume the standard competition ranking (a.k.a. "1224" ranking) is used.

The question I'd like to investigate here is what is the "best" way of doing this using the streaming API. Where best may mean, most elengant, most efficient, simplest, etc.

Case study

Let's start with a very simple example: a stream of words (i.e. Strings) out of which a ranking by decreasing lengths should be created. If we take the first paragraph of this document, i.e.

A common problem when processing data is to build 
a ranking of itemsbased on some property

we would get the following ranking (by inverse lengths) should be obtained:

1 - processing (10)
2 - property (8)
3 - problem (7)
3 - ranking (7)
5 - common (6)
6 - build (5)
6 - items (5)
6 - based (5)
9 - when (4)
9 - data (4)
9 - some (4)
12 - is (2)
12 - to (2)
12 - of (2)
12 - on (2)
16 - A (1)
16 - a (1)

When two items share the same ranking property value, they are assigned the same ranking.

Questions

The main question here is:

what stream API construct do you used to compute the ranking and how you organize your code?

A second related question is

how you represent the result of the ranking computation?

like image 460
Marco Torchiano Avatar asked Apr 03 '17 10:04

Marco Torchiano


2 Answers

The two questions are addressed one after the other

  • ranking representation
  • ranking procedure

Ranking representation

The typical ranking, as shown above, is a list of items preceded by their relative ranking value.

There are, at least, two possible implementations for the return value of a ranking.

Option 1: List<Rank>

A possible internal representation of the ranking could be a list of elements containing both the ranking value and the item. For instance we could define a class:

public class Rank<T> {
  protected int ranking;
  protected T item;
  protected Rank(int r, T i) { ranking=t; item=i; }
  public int getRanking() { return ranking; }
  public T getItem() { return item; }
  static <T> Rank<T> factory(int ranking, T item) { 
      return new Rank<T>(ranking,item);
  }
}

Or, with a somewhat more esotheric code:

public interface Rank<T> {
      int getRanking();
      T getItem();
      static <T> Rank<T> factory(int ranking, T item) { 
          return new Rank<T>(){
               public int getRanking() { return ranking; }
               public T getItem() { return item; }
          };}
}

The resulting ranking will be returned as a List<Rank>. The list must be returned sorted by descending ranking and processed in that order.

A potential disadvantage of such solution is the need for a new ad-hoc class or interface (i.e. Rank).

Option 2: SortedMap<Integer,List<T>>

An alternative solution that uses only predefined standard classes and intefaces is based on a map.

The keys of the map correspond to the ranking values, while the values of the map consist in lists that contain the items sharing that ranking.

The return type of the ranking would be a SortedMap<T,List<T>>. The sorted map will be implicitly sorted and can be traversed according to the natural order of the key.

This latter solution seems preferable because it does not require any new class and can be better understood.


Ranking Procedure

The ranking computation procedure can be implemented in different ways. Here two approaches are examined.

In all cases we need to have a function that serves the purpose of extracting the ranking property:

Function<String,Integer> propertyExtractor = String::length;

Typically, the property type should be comparable, though to be more general (e.g. to sort strings in decreasing order) the property comparator can be defined:

Comparator<Integer> propertyComparator = reverseOrder();

The two approaches are illustrated, referring to the case study described above, using a words stream as a starting point:

String paragraph = "A common problem when processing data is to build  a ranking of items based on some property";
Stream<String> words = Stream.of(paragraph.split(" "));

Sorting by ranking property

Basic procedure

An initial naive version of the procedure can be defined using a supporting map data structure upon which the stream operations work by means of side effects:

SortedMap<Integer,List<String>> ranking = new TreeMap<>();

The procedure is as follows:

  1. Sort the elements of the stream

    words.sorted(comparing(propetyExtractor,propertyComparator))
    
  2. Assign the ranking, starting with 1 and incrementing it any time the property changes. The ranking has to be incremented by the number of items that shared the current ranking.

    .forEach( item -> {
        Integer property = propertyExtractor.apply(item);
        if(ranking.isEmpty()){
            ranking.put(1,new LinkedList<String>());
        }else{
            Integer rank = ranking.lastKey();
            List<String> items = ranking.get(rank);
            if(! property.equals(propertyExtractor.apply(items.get(0)))) {
                ranking.put(rank+items.size(), new LinkedList<String>());
            }
        }
        ranking.get(ranking.lastKey()).add(item);
    });
    

The above code works as follows:

  • The initial sequence of items is unsorted:

    {"DD","BBB","F","CCC","AAAA","EE"}

  • Step 1 sorts the item using the propertyExtractor and the relative propertyComparator, i.e. by string length:

    {"AAAA","BBB","CCC","DD","EE","F"}

  • then for each element (in order) a new entry in the map is added any time the lenght of the new element differs from the previous element lenght

    • When "AAAA" is processed (first item) a new entry is added to the map with key (i.e. ranking) = 1:

      ranking : {1=["AAAA"]}

    • As the second item "BBB" is processed, since its lenght (3) is different from the last element length (4, the lenght of "AAAA") a new entry is added with un updated ranking value:

      ranking : {1=["AAAA"],2=["BBB"]}

    • As the third item "CCC" is processed, since its lenght (3) is equal to the last element length, the item is added to the entry list

      ranking : {1=["AAAA"],2=["BBB","CCC"]}

    • etc.

Usage of collector

A more functional-style version can be defined by using a collector that encapsulates the data structure where the results are accumulated. In practice we can write a single stream expression that returns the ranking map:

SortedMap<Integer,List<String>> ranking = 
words.sorted(comparing(propertyExtractor,propertyComparator))
     .collect(TreeMap::new, // supplier

             (rank, item) -> {  // accumulator
                 Integer property = propertyExtractor.apply(item);
                 if(rank.isEmpty()){
                     rank.put(1,new LinkedList<String>());
                 }else{
                     Integer r = rank.lastKey();
                     List<String> items = rank.get(r);
                     Integer prevProp = propertyExtractor.apply(items.get(0))
                     if(! property.equals(prevProp)) {
                         rank.put(r+items.size(), new LinkedList<String>());
                     }
                 }
                 rank.get(rank.lastKey()).add(item);
             },

             (rank1,rank2) -> { // combiner
                 \\...
             }
     );

Combining results

The combiner method that is left undefiened in the above code deserves some additional reflections.

The functional interface is involved when the collection is performed in parallel; it is used to combine two partial results of the accumulation into a single one. In this case, it has to implement the interface BiConsumer<R,R> and it is supposed to combine the two accumulated rankings - rank1 and rank2 - into the first one.

A possible implementation of the supplier is:

BiConsumer<SortedMap<Integer,List<String>>,
             SortedMap<Integer,List<String>>> combiner =  
(rank1,rank2) -> {
        int lastRanking = rank1.lastKey();
        int offset = lastRanking + rank1.get(lastRanking).size()-1;
        if( propertyExtractor.apply(rank1.get(lastRanking).get(0))
            == propertyExtractor.apply(rank2.get(rank2.firstKey()).get(0)) ){
            rank1.get(lastRanking).addAll(rank2.get(rank2.firstKey()));
            rank2.remove(rank2.firstKey());
        }
        rank2.forEach((r,items) -> {rank1.put(offset+r, items);} );
}

The collector has no declared properties, so the elements will be processed in order and then composed in order. E.g. the sorted list of items is divided into two parts, then the first part is processed through with one copy of the collector producing as result the map rank1; in parallel, the second part is processed giving as result rank2.

Let us suppose the streams are processed in parallel, in two concurent execution of the collect operations:

  • the initial sorted stream (result of the sorted() operation is divided into two stream, maintaining the orded

    • sub-stream 1: {"AAAA","BBB","CCC"}
    • sub-stream 2: {"DD","EE","F"}
  • two distinct collectors operate in parallel, on each sub-stream the result are two maps contain the partial ranking of each sub-stream:

    • rank1 = {1=["AAAA"],2=["BBB","CCC"]}
    • rank2 = {1=["DD","EE"],3=["F"]}
  • The combiner operation should merge rank2 into rank1, in practice each entry of the rank2 operation should be added to rank1 with its key updated. They keys are updated by adding an offset that is equal to the key of the last entry plus the lenght of the last entry value list minus one:

    int lastRanking = rank1.lastKey();
    int offset = lastRanking + rank1.get(lastRanking).size()-1;
    

    In practice the entry 1=["DD","EE"] of rank2 should be turned into 4=["DD","EE"] and added to rank1.

  • In addition, a special case should be considered, i.e. when the items in the last entry of rank1 and those in the first entry of rank2 share the same ranking property value. E.g. for string length:

    • rank1 = {1=["AAAA"],2=["BBB","CCC"]}
    • rank2 = {1=["DDD"],2=["EE"],3=["F"]}

    When this case occurs, the items in the rank2 first entry list must be added to this in the last rank1 entry list, and then the first entry removed. That is the above maps should be transformed into:

    • rank1 = {1=["AAAA"],2=["BBB","CCC","DDD"]}
    • rank2 = {~~1=["DDD"],~~2=["EE"],3=["F"]}

    Then the entries of rank2 can be updated and added to rank1 as described above.

Grouping by ranking property

Basic procedure

As for the sorted list version, an initial naive version of the procedure can be defined using a supporting map data structure upon which the stream operations work by means of side effects:

SortedMap<Integer,List<String>> ranking = new TreeMap<>();

The naive procedure is as follows:

  1. group the items by their property in a sorted map

    words.collect(groupingBy(propertyExtractor,
                            ()->new TreeMap<>(propertyComparator),toList()))
    
  2. for each entry in the above map compute the ranking

         .forEach(
            (property,items) -> 
                ranking.put(ranking.isEmpty()?1:ranking.lastKey()+
                                    ranking.get(ranking.lastKey()).size(),
                            items )
                 );
    

The above code works as follows:

  • The initial sequence of items is unsorted:

    {"DD","BBB","F","CCC","AAAA","EE"}

  • Step 1 groups the item by their propertyExtractor and store them into a sorted set whose keys are sorted through the propertyComparator, i.e. by string length:

    {4=["AAAA"],3=["BBB","CCC"],2=["DD","EE"],1=["F"]}

  • Step 2, creates for each entry in the intermediate map a new entry in the result map have the ranking value as key and the same value (i.e. the list of items) as the intermediate entry.

    The ranking is computed as follows:

    • 1 for the first entry (i.e. when the result map is still empty), or
    • previous ranking (ranking.lastKey()) plus number of elements sharing the previous ranking (ranking.get(ranking.lastKey()).size()).

    The result is the final ranking map:

    {1=["AAAA"],2=["BBB","CCC"],4=["DD","EE"],6=["F"]}

Using a collector

The above procedure can be rewritten using a collector to avoid side effects of the operations.

Since the first step consists in a collection, we can use the predefined collector factory methdo collectingAndThen to concatenate the first collector with a function that is applied to the collector result; such a function will perform the step 2 described above.

SortedMap<Integer,List<String>> ranking = 
words.collect(
       collectingAndThen(
            groupingBy(propertyExtractor, 
                         ()->new TreeMap<>(propertyComparator),
                         toList()),
            map -> map.entrySet().stream().collect(
                      TreeMap::new,
                      (rank,entry) -> 
                          rank.put(rank.isEmpty()?1:rank.lastKey()+ 
                                        rank.get(rank.lastKey()).size(),
                                 entry.getValue() ),
                      combiner 
                    )
        )
);

Since the structure of the result, i.e. the accumulator object, is the same as the sorted stream solution, here the same combiner can be used.

Generic solutions

The above discussion and solution applied to the special case of a stream of Strings. But the approach can be generalized using generics.

Ranking sorting function

The solution based on the sorted stream can be enclosed in a function:

static <T,V> SortedMap<Integer,List<T>> 
rank(Stream<T> stream, 
     Function<T,V> propertyExtractor, 
     Comparator<V> propertyComparator){
  return
  stream.sorted(comparing(propertyExtractor,propertyComparator))
        .collect(TreeMap::new, 
                 (rank, item) -> {
                    V property = propertyExtractor.apply(item);
                    if(rank.isEmpty()){
                      rank.put(new Integer(1),new LinkedList<T>());
                    }else{
                      Integer r = rank.lastKey();
                      List<T> items = rank.get(r);
                      if(! property.equals(propertyExtractor.apply(items.get(0)))) {
                        rank.put(r+items.size(), new LinkedList<T>());
                      }
                    }
                    rank.get(rank.lastKey()).add(item);
                },
            (rank1,rank2) -> { 
                   int lastRanking = rank1.lastKey();
                   int offset = lastRanking + rank1.get(lastRanking).size()-1;
                   if( propertyExtractor.apply(rank1.get(lastRanking).get(0))
                       == propertyExtractor.apply(rank2.get(rank2.firstKey()).get(0))){
                     rank1.get(lastRanking).addAll(rank2.get(rank2.firstKey()));
                     rank2.remove(rank2.firstKey());
                   }
                   rank2.forEach((r,items) -> {rank1.put(offset+r, items);} );
            }
    );
}

The above method can be applied as:

SortedMap<Integer,List<String>> ranking =
            rank(words,String::length,reverseOrder());

Ranking grouping collector

The approach based on the grouping by property value can be encapsulated in a collector:

static <T,V> Collector<T,?,SortedMap<Integer,List<T>>> 
rankingCollector(Function<T,V> propertyExtractor, 
                 Comparator<V> propertyComparator){
  return 
  collectingAndThen(
    groupingBy(propertyExtractor,
               ()->new TreeMap<>(propertyComparator),
               toList()),
    map -> map.entrySet().stream().collect(
            TreeMap::new,
            (rank,entry) -> 
                  rank.put(rank.isEmpty()?1:rank.lastKey()+
                             rank.get(rank.lastKey()).size(),
                           entry.getValue() ),
            (rank1,rank2) -> { 
                   int lastRanking = rank1.lastKey();
                   int offset = lastRanking + rank1.get(lastRanking).size()-1;
                   if( propertyExtractor.apply(rank1.get(lastRanking).get(0))
                       == 

    propertyExtractor.apply(rank2.get(rank2.firstKey()).get(0))){
                         rank1.get(lastRanking).addAll(rank2.get(rank2.firstKey()));
                         rank2.remove(rank2.firstKey());
                       }
                       rank2.forEach((r,items) -> {rank1.put(offset+r, items);} );
                }
         )
      );
    }

This collector factory method can be used e.g. as:

SortedMap<Integer,List<String>> ranking =
        words.collect(rankingCollector(String::length,reverseOrder()));

Printing the ranking

Once the ranking has been computed and stored in the map, it can be printed, typically for debug purposes.

Here are a few possible options for printing the ranking on the console.

Printer Consumer

Using a consumer object that takes the ranking, format the entries and print them. The following code reports a factory method the returns such a consumer:

static <T,V> Consumer<Map<Integer,List<T>>> 
rankPrinter(Function<T,V> propertyExtractor){
      return ranking ->
       ranking.entrySet().stream()
       .map( e -> e.getValue().stream().map( 
                   v -> e.getKey() + " - " + v 
                       + " (" + propertyExtractor.apply(v) + ")" ) )
        .flatMap(Function.identity())
        .forEach(System.out::println);
}

Collator Function

Using a function that transform the ranking map into a string consisting in the concatenation of the items.

static <T,V> Function<Map<Integer,List<T>>,String> 
rankCollator(Function<T,V> propertyExtractor){
  return ranking ->         
        ranking.entrySet().stream()
                .map( e -> (Stream<String>)e.getValue().stream().
                        map( v -> (String)(e.getKey() + " : " + v + " (" + propertyExtractor.apply(v) + ")") ))
                        .flatMap(Function.identity())
                        .collect(joining("\n"));
    }

The above method can be used as follows:

System.out.println(rankCollator(propertyExtractor).apply(ranking));

Ranking Map class

As an alternative, is is possible to replace the class TreeMap with a new class that extends it and overrides the toString() method.

This cane be done by writing the following collector accumulator supplier, instead of TreeMap::new:

()->new TreeMap<Integer,List<T>>(){
    public String toString(){
        return entrySet().stream()
                .map( e -> (Stream<String>)e.getValue().stream().map( 
                            v -> e.getKey().toString() + " - " + v.toString() 
                                    + " (" + propertyExtractor.apply(v) + ")" ) )
                .flatMap(Function.identity())
                .collect(joining("\n")); 
    }
}

The complete code for the generics solution is available in class StreamRanking available on my github repo.

like image 53
Marco Torchiano Avatar answered Nov 14 '22 00:11

Marco Torchiano


Unlike solved by yourself by Design First I doing it via TDD way.

Step 1, assuming there is only one word exists.

let's say ranking "foo" will represented as ["1 - foo (3)"],I solved it by using Stream.map(Function) and fake the rank with constant 1.

Arrays.stream(sentence.split(" "))
      .map(it -> String.format("%d - %s (%d)", 1, it, it.length()))
      .collect(toList());

Step 2, assuming there are 2 words that their length are the same.

let's say ranking "foo bar" will represented as ["1 - bar (3)","1 - foo (3)"]. I solved it by Stream.sorted(Comparator);

Arrays.stream(sentence.split(" "))
      .sorted(String::compareTo)
      .map(it -> String.format("%d - %s (%d)", 1, it, it.length()))
      .collect(toList());

Step 3, assuming there are 2 words that their length are different.

let's say ranking "fuzz bar" will represented as ["1 - fuzz (4)","2 - bar (3)"],I solved it by Comparator.comparing(Function).

Arrays.stream(sentence.split(" "))
      .sorted(Comparator.comparing(String::length).reversed()
                        .thenComparing(String::compareTo))
      .map(it -> String.format("%d - %s (%d)", 1, it, it.length()))
      .collect(toList());

the result ["1 - fuzz (4)","1~~2~~ - bar (3)"] are the same except the faked rank constant 1.so I do a small refactoring before change code:

Arrays.stream(sentence.split(" "))
      .sorted(Comparator.comparing(String::length).reversed()
                        .thenComparing(String::compareTo))
      .map(it -> new AbstractMap.SimpleEntry<>(1,it))
      .map(it -> String.format("%d - %s (%d)", it.getKey(), it.getValue(), it.getValue().length()))
      .collect(toList());

then I can replace the first Stream.map(Function) with Stream.collect(...).stream() and replace the faked rank constant 1 with ranking.size() + 1.

Arrays.stream(sentence.split(" "))
      .sorted(Comparator.comparing(String::length).reversed()
                        .thenComparing(String::compareTo))
      .collect(ArrayList<Map.Entry<Integer, String>>::new
              , (ranking, it) -> ranking.add(new AbstractMap.SimpleEntry<>(ranking.size() + 1, it))
              , List::addAll).stream()
      .map(it -> String.format("%d - %s (%d)", it.getKey(), it.getValue(), it.getValue().length()))
      .collect(toList());

but the Step 2 broked due to rank = ranking.size() + 1; then I realized I need comparing its length with the last ranking item.

BiConsumer<List<Map.Entry<Integer, String>>, String> accumulator = (ranking, it) -> {
    int rank = ranking.size() + 1;
    if (!ranking.isEmpty()) {
        Map.Entry<Integer, String> last = ranking.get(ranking.size() - 1);
        if (last.getValue().length() == it.length()) {
            rank = last.getKey();
        }
    }
    ranking.add(new AbstractMap.SimpleEntry<>(rank, it));
};

List<String> ranking = Arrays.stream(sentence.split(" "))
        .sorted(Comparator.comparing(String::length).reversed()
                          .thenComparing(String::compareTo))
        .collect(ArrayList::new, accumulator, List::addAll).stream()
        .map(it -> String.format("%d - %s (%d)", it.getKey(), it.getValue(), it.getValue().length()))
        .collect(toList());

Indeed, I find I can replace an ArrayList with a Stack in further:

BiConsumer<Stack<Map.Entry<Integer, String>>, String> accumulator = (ranking, it) -> {
    int rank = ranking.size() + 1;
    if (!ranking.isEmpty()) {
        Map.Entry<Integer, String> last = ranking.peek();
        if (last.getValue().length() == it.length()) {
            rank = last.getKey();
        }
    }
    ranking.add(new AbstractMap.SimpleEntry<>(rank, it));
};

List<String> ranking = Arrays.stream(sentence.split(" "))
        .sorted(Comparator.comparing(String::length).reversed()
                          .thenComparing(String::compareTo))
        .collect(Stack::new, accumulator, List::addAll).stream()
        .map(it -> String.format("%d - %s (%d)", it.getKey(), it.getValue(), it.getValue().length()))
        .collect(toList());

Fix a bug by introduce another test

After finishing the code, I found there is a bug existed in the code. let's say ranking "fuzz buzz foo" will represented as ["1 - buzz (4)","1 - fuzz (4)","2 - foo (3)"], I solved it by calculate rank as last.rank + 1 rather than ranking.size() + 1.

BiConsumer<Stack<Entry<Integer, String>>, String> accumulator = (ranking, it) -> {
    int rank;
    if (!ranking.isEmpty()) {
        Entry<Integer, String> last = ranking.peek();
        if (last.getValue().length() == it.length()) {
            rank = last.getKey();
        } else {
            rank = last.getKey() + 1;
        }
    } else {
        rank = 1;
    }
    ranking.add(new AbstractMap.SimpleEntry<>(rank, it));
};

List<String> ranking = Arrays.stream(sentence.split(" "))
        .sorted(comparing(String::length).reversed()
               .thenComparing(String::compareTo))
        .collect(Stack::new, accumulator, List::addAll).stream()
        .map(it -> format("%d - %s (%d)", it.getKey(), it.getValue(), it.getValue().length()))
        .collect(toList());

Finally

I rename variable accumulator with a meaningful name, e.g:rankingEvaluation, and replace null with Null Object, and .etc.

Map.Entry<Integer, String> NONE = new AbstractMap.SimpleEntry<>(0, "");

BiConsumer<Stack<Entry<Integer, String>>, String> rankingEvaluation = (ranking, it) -> {
    Entry<Integer, String> last = ranking.isEmpty() ? NONE : ranking.peek();
    int rank = last.getValue().length() == it.length() 
                                         ? last.getKey() 
                                         : last.getKey() + 1;
    ranking.add(new AbstractMap.SimpleEntry<>(rank, it));
};

List<String> ranking = Arrays.stream(sentence.split(" "))
        .sorted(comparing(String::length).reversed()
               .thenComparing(String::compareTo))
        .collect(Stack::new, rankingEvaluation, List::addAll).stream()
        .map(it -> format("%d - %s (%d)", it.getKey(), it.getValue(), it.getValue().length()))
        .collect(toList());
like image 4
holi-java Avatar answered Nov 13 '22 22:11

holi-java