A common problem when processing data is to build a ranking of items based on some property.
The ranking operation consists in mapping the items to a set of ordinal numbers (called ranking). In presence of ties (i.e. two items having the same ranking) several strategies can be used, in this context we will assume the standard competition ranking (a.k.a. "1224" ranking) is used.
The question I'd like to investigate here is what is the "best" way of doing this using the streaming API. Where best may mean, most elengant, most efficient, simplest, etc.
Let's start with a very simple example: a stream of words (i.e. String
s)
out of which a ranking by decreasing lengths should be created.
If we take the first paragraph of this document, i.e.
A common problem when processing data is to build
a ranking of itemsbased on some property
we would get the following ranking (by inverse lengths) should be obtained:
1 - processing (10)
2 - property (8)
3 - problem (7)
3 - ranking (7)
5 - common (6)
6 - build (5)
6 - items (5)
6 - based (5)
9 - when (4)
9 - data (4)
9 - some (4)
12 - is (2)
12 - to (2)
12 - of (2)
12 - on (2)
16 - A (1)
16 - a (1)
When two items share the same ranking property value, they are assigned the same ranking.
The main question here is:
what stream API construct do you used to compute the ranking and how you organize your code?
A second related question is
how you represent the result of the ranking computation?
The two questions are addressed one after the other
The typical ranking, as shown above, is a list of items preceded by their relative ranking value.
There are, at least, two possible implementations for the return value of a ranking.
List<Rank>
A possible internal representation of the ranking could be a list of elements containing both the ranking value and the item. For instance we could define a class:
public class Rank<T> {
protected int ranking;
protected T item;
protected Rank(int r, T i) { ranking=t; item=i; }
public int getRanking() { return ranking; }
public T getItem() { return item; }
static <T> Rank<T> factory(int ranking, T item) {
return new Rank<T>(ranking,item);
}
}
Or, with a somewhat more esotheric code:
public interface Rank<T> {
int getRanking();
T getItem();
static <T> Rank<T> factory(int ranking, T item) {
return new Rank<T>(){
public int getRanking() { return ranking; }
public T getItem() { return item; }
};}
}
The resulting ranking will be returned as a List<Rank>
.
The list must be returned sorted by descending ranking and processed in that order.
A potential disadvantage of such solution is the need for a new ad-hoc class or interface (i.e. Rank
).
SortedMap<Integer,List<T>>
An alternative solution that uses only predefined standard classes and intefaces is based on a map.
The keys of the map correspond to the ranking values, while the values of the map consist in lists that contain the items sharing that ranking.
The return type of the ranking would be a SortedMap<T,List<T>>
.
The sorted map will be implicitly sorted and can be traversed according to the natural order of the key.
This latter solution seems preferable because it does not require any new class and can be better understood.
The ranking computation procedure can be implemented in different ways. Here two approaches are examined.
In all cases we need to have a function that serves the purpose of extracting the ranking property:
Function<String,Integer> propertyExtractor = String::length;
Typically, the property type should be comparable, though to be more general (e.g. to sort strings in decreasing order) the property comparator can be defined:
Comparator<Integer> propertyComparator = reverseOrder();
The two approaches are illustrated, referring to the case study described above, using a words
stream as a starting point:
String paragraph = "A common problem when processing data is to build a ranking of items based on some property";
Stream<String> words = Stream.of(paragraph.split(" "));
An initial naive version of the procedure can be defined using a supporting map data structure upon which the stream operations work by means of side effects:
SortedMap<Integer,List<String>> ranking = new TreeMap<>();
The procedure is as follows:
Sort the elements of the stream
words.sorted(comparing(propetyExtractor,propertyComparator))
Assign the ranking, starting with 1
and incrementing it
any time the property changes. The ranking has to be incremented
by the number of items that shared the current ranking.
.forEach( item -> {
Integer property = propertyExtractor.apply(item);
if(ranking.isEmpty()){
ranking.put(1,new LinkedList<String>());
}else{
Integer rank = ranking.lastKey();
List<String> items = ranking.get(rank);
if(! property.equals(propertyExtractor.apply(items.get(0)))) {
ranking.put(rank+items.size(), new LinkedList<String>());
}
}
ranking.get(ranking.lastKey()).add(item);
});
The above code works as follows:
The initial sequence of items is unsorted:
{"DD","BBB","F","CCC","AAAA","EE"}
Step 1 sorts the item using the propertyExtractor
and the relative propertyComparator
, i.e. by string length:
{"AAAA","BBB","CCC","DD","EE","F"}
then for each element (in order) a new entry in the map is added any time the lenght of the new element differs from the previous element lenght
When "AAAA"
is processed (first item) a new entry is added to
the map with key (i.e. ranking) = 1:
ranking
: {1=["AAAA"]}
As the second item "BBB"
is processed, since its lenght (3)
is different from the last element length (4, the lenght of "AAAA"
)
a new entry is added with un updated ranking value:
ranking
: {1=["AAAA"],2=["BBB"]}
As the third item "CCC"
is processed, since its lenght (3)
is equal to the last element length, the item is added
to the entry list
ranking
: {1=["AAAA"],2=["BBB","CCC"]}
etc.
A more functional-style version can be defined by using a collector that encapsulates the data structure where the results are accumulated. In practice we can write a single stream expression that returns the ranking map:
SortedMap<Integer,List<String>> ranking =
words.sorted(comparing(propertyExtractor,propertyComparator))
.collect(TreeMap::new, // supplier
(rank, item) -> { // accumulator
Integer property = propertyExtractor.apply(item);
if(rank.isEmpty()){
rank.put(1,new LinkedList<String>());
}else{
Integer r = rank.lastKey();
List<String> items = rank.get(r);
Integer prevProp = propertyExtractor.apply(items.get(0))
if(! property.equals(prevProp)) {
rank.put(r+items.size(), new LinkedList<String>());
}
}
rank.get(rank.lastKey()).add(item);
},
(rank1,rank2) -> { // combiner
\\...
}
);
The combiner method that is left undefiened in the above code deserves some additional reflections.
The functional interface is involved when the collection is performed in parallel; it is used to combine two partial results of the accumulation into a single one. In this case, it has to implement the interface BiConsumer<R,R>
and it is supposed to combine the two accumulated rankings - rank1
and rank2
- into the first one.
A possible implementation of the supplier is:
BiConsumer<SortedMap<Integer,List<String>>,
SortedMap<Integer,List<String>>> combiner =
(rank1,rank2) -> {
int lastRanking = rank1.lastKey();
int offset = lastRanking + rank1.get(lastRanking).size()-1;
if( propertyExtractor.apply(rank1.get(lastRanking).get(0))
== propertyExtractor.apply(rank2.get(rank2.firstKey()).get(0)) ){
rank1.get(lastRanking).addAll(rank2.get(rank2.firstKey()));
rank2.remove(rank2.firstKey());
}
rank2.forEach((r,items) -> {rank1.put(offset+r, items);} );
}
The collector has no declared properties, so the elements will be processed in order and then composed in order. E.g. the sorted list of items is divided into two parts, then the first part is processed through with one copy of the collector producing as result the map rank1
; in parallel, the second part is processed giving as result rank2
.
Let us suppose the streams are processed in parallel, in two concurent execution of the collect operations:
the initial sorted stream (result of the sorted()
operation is
divided into two stream, maintaining the orded
{"AAAA","BBB","CCC"}
{"DD","EE","F"}
two distinct collectors operate in parallel, on each sub-stream the result are two maps contain the partial ranking of each sub-stream:
rank1
= {1=["AAAA"],2=["BBB","CCC"]}
rank2
= {1=["DD","EE"],3=["F"]}
The combiner operation should merge rank2
into rank1
, in practice
each entry of the rank2
operation should be added to rank1
with its
key updated. They keys are updated by adding an offset that is equal to
the key of the last entry plus the lenght of the last entry value list
minus one:
int lastRanking = rank1.lastKey();
int offset = lastRanking + rank1.get(lastRanking).size()-1;
In practice the entry 1=["DD","EE"]
of rank2
should be turned into
4=["DD","EE"]
and added to rank1
.
In addition, a special case should be considered, i.e. when the items
in the last entry of rank1
and those in the first entry of rank2
share the same ranking property value. E.g. for string length:
rank1
= {1=["AAAA"],2=["BBB","CCC"]}
rank2
= {1=["DDD"],2=["EE"],3=["F"]}
When this case occurs, the items in the rank2
first entry list
must be added to this in the last rank1
entry list, and then the
first entry removed.
That is the above maps should be transformed into:
rank1
= {1=["AAAA"],2=["BBB","CCC"
,"DDD"
]}
rank2
= {
~~1=["DDD"],
~~2=["EE"],3=["F"]}
Then the entries of rank2
can be updated and added to rank1
as
described above.
As for the sorted list version, an initial naive version of the procedure can be defined using a supporting map data structure upon which the stream operations work by means of side effects:
SortedMap<Integer,List<String>> ranking = new TreeMap<>();
The naive procedure is as follows:
group the items by their property in a sorted map
words.collect(groupingBy(propertyExtractor,
()->new TreeMap<>(propertyComparator),toList()))
for each entry in the above map compute the ranking
.forEach(
(property,items) ->
ranking.put(ranking.isEmpty()?1:ranking.lastKey()+
ranking.get(ranking.lastKey()).size(),
items )
);
The above code works as follows:
The initial sequence of items is unsorted:
{"DD","BBB","F","CCC","AAAA","EE"}
Step 1 groups the item by their propertyExtractor
and store them into
a sorted set whose keys are sorted through the propertyComparator
, i.e.
by string length:
{4=["AAAA"],3=["BBB","CCC"],2=["DD","EE"],1=["F"]}
Step 2, creates for each entry in the intermediate map a new entry in the result map have the ranking value as key and the same value (i.e. the list of items) as the intermediate entry.
The ranking is computed as follows:
ranking.lastKey()
) plus number of elements
sharing the previous ranking (ranking.get(ranking.lastKey()).size()
).The result is the final ranking map:
{1=["AAAA"],2=["BBB","CCC"],4=["DD","EE"],6=["F"]}
The above procedure can be rewritten using a collector to avoid side effects of the operations.
Since the first step consists in a collection, we can use the predefined collector factory methdo collectingAndThen
to concatenate the first collector with a function that is applied to the collector result;
such a function will perform the step 2 described above.
SortedMap<Integer,List<String>> ranking =
words.collect(
collectingAndThen(
groupingBy(propertyExtractor,
()->new TreeMap<>(propertyComparator),
toList()),
map -> map.entrySet().stream().collect(
TreeMap::new,
(rank,entry) ->
rank.put(rank.isEmpty()?1:rank.lastKey()+
rank.get(rank.lastKey()).size(),
entry.getValue() ),
combiner
)
)
);
Since the structure of the result, i.e. the accumulator object, is the same as the sorted stream solution, here the same combiner can be used.
The above discussion and solution applied to the special case of a stream of Strings. But the approach can be generalized using generics.
The solution based on the sorted stream can be enclosed in a function:
static <T,V> SortedMap<Integer,List<T>>
rank(Stream<T> stream,
Function<T,V> propertyExtractor,
Comparator<V> propertyComparator){
return
stream.sorted(comparing(propertyExtractor,propertyComparator))
.collect(TreeMap::new,
(rank, item) -> {
V property = propertyExtractor.apply(item);
if(rank.isEmpty()){
rank.put(new Integer(1),new LinkedList<T>());
}else{
Integer r = rank.lastKey();
List<T> items = rank.get(r);
if(! property.equals(propertyExtractor.apply(items.get(0)))) {
rank.put(r+items.size(), new LinkedList<T>());
}
}
rank.get(rank.lastKey()).add(item);
},
(rank1,rank2) -> {
int lastRanking = rank1.lastKey();
int offset = lastRanking + rank1.get(lastRanking).size()-1;
if( propertyExtractor.apply(rank1.get(lastRanking).get(0))
== propertyExtractor.apply(rank2.get(rank2.firstKey()).get(0))){
rank1.get(lastRanking).addAll(rank2.get(rank2.firstKey()));
rank2.remove(rank2.firstKey());
}
rank2.forEach((r,items) -> {rank1.put(offset+r, items);} );
}
);
}
The above method can be applied as:
SortedMap<Integer,List<String>> ranking =
rank(words,String::length,reverseOrder());
The approach based on the grouping by property value can be encapsulated in a collector:
static <T,V> Collector<T,?,SortedMap<Integer,List<T>>>
rankingCollector(Function<T,V> propertyExtractor,
Comparator<V> propertyComparator){
return
collectingAndThen(
groupingBy(propertyExtractor,
()->new TreeMap<>(propertyComparator),
toList()),
map -> map.entrySet().stream().collect(
TreeMap::new,
(rank,entry) ->
rank.put(rank.isEmpty()?1:rank.lastKey()+
rank.get(rank.lastKey()).size(),
entry.getValue() ),
(rank1,rank2) -> {
int lastRanking = rank1.lastKey();
int offset = lastRanking + rank1.get(lastRanking).size()-1;
if( propertyExtractor.apply(rank1.get(lastRanking).get(0))
==
propertyExtractor.apply(rank2.get(rank2.firstKey()).get(0))){
rank1.get(lastRanking).addAll(rank2.get(rank2.firstKey()));
rank2.remove(rank2.firstKey());
}
rank2.forEach((r,items) -> {rank1.put(offset+r, items);} );
}
)
);
}
This collector factory method can be used e.g. as:
SortedMap<Integer,List<String>> ranking =
words.collect(rankingCollector(String::length,reverseOrder()));
Once the ranking has been computed and stored in the map, it can be printed, typically for debug purposes.
Here are a few possible options for printing the ranking on the console.
Consumer
Using a consumer object that takes the ranking, format the entries and print them. The following code reports a factory method the returns such a consumer:
static <T,V> Consumer<Map<Integer,List<T>>>
rankPrinter(Function<T,V> propertyExtractor){
return ranking ->
ranking.entrySet().stream()
.map( e -> e.getValue().stream().map(
v -> e.getKey() + " - " + v
+ " (" + propertyExtractor.apply(v) + ")" ) )
.flatMap(Function.identity())
.forEach(System.out::println);
}
Function
Using a function that transform the ranking map into a string consisting in the concatenation of the items.
static <T,V> Function<Map<Integer,List<T>>,String>
rankCollator(Function<T,V> propertyExtractor){
return ranking ->
ranking.entrySet().stream()
.map( e -> (Stream<String>)e.getValue().stream().
map( v -> (String)(e.getKey() + " : " + v + " (" + propertyExtractor.apply(v) + ")") ))
.flatMap(Function.identity())
.collect(joining("\n"));
}
The above method can be used as follows:
System.out.println(rankCollator(propertyExtractor).apply(ranking));
Map
classAs an alternative, is is possible to replace the class TreeMap
with a new class that extends it and overrides the toString()
method.
This cane be done by writing the following collector accumulator
supplier, instead of TreeMap::new
:
()->new TreeMap<Integer,List<T>>(){
public String toString(){
return entrySet().stream()
.map( e -> (Stream<String>)e.getValue().stream().map(
v -> e.getKey().toString() + " - " + v.toString()
+ " (" + propertyExtractor.apply(v) + ")" ) )
.flatMap(Function.identity())
.collect(joining("\n"));
}
}
The complete code for the generics solution is available in class StreamRanking available on my github repo.
Unlike solved by yourself by Design First I doing it via TDD way.
let's say ranking "foo"
will represented as ["1 - foo (3)"]
,I solved it by using Stream.map(Function) and fake the rank with constant 1
.
Arrays.stream(sentence.split(" "))
.map(it -> String.format("%d - %s (%d)", 1, it, it.length()))
.collect(toList());
let's say ranking "foo bar"
will represented as ["1 - bar (3)","1 - foo (3)"]
. I solved it by Stream.sorted(Comparator);
Arrays.stream(sentence.split(" "))
.sorted(String::compareTo)
.map(it -> String.format("%d - %s (%d)", 1, it, it.length()))
.collect(toList());
let's say ranking "fuzz bar"
will represented as ["1 - fuzz (4)","2 - bar (3)"]
,I solved it by Comparator.comparing(Function).
Arrays.stream(sentence.split(" "))
.sorted(Comparator.comparing(String::length).reversed()
.thenComparing(String::compareTo))
.map(it -> String.format("%d - %s (%d)", 1, it, it.length()))
.collect(toList());
the result ["1 - fuzz (4)","1~~2~~ - bar (3)"] are the same except the faked rank constant 1
.so I do a small refactoring before change code:
Arrays.stream(sentence.split(" "))
.sorted(Comparator.comparing(String::length).reversed()
.thenComparing(String::compareTo))
.map(it -> new AbstractMap.SimpleEntry<>(1,it))
.map(it -> String.format("%d - %s (%d)", it.getKey(), it.getValue(), it.getValue().length()))
.collect(toList());
then I can replace the first Stream.map(Function) with Stream.collect(...).stream() and replace the faked rank constant 1
with ranking.size() + 1
.
Arrays.stream(sentence.split(" "))
.sorted(Comparator.comparing(String::length).reversed()
.thenComparing(String::compareTo))
.collect(ArrayList<Map.Entry<Integer, String>>::new
, (ranking, it) -> ranking.add(new AbstractMap.SimpleEntry<>(ranking.size() + 1, it))
, List::addAll).stream()
.map(it -> String.format("%d - %s (%d)", it.getKey(), it.getValue(), it.getValue().length()))
.collect(toList());
but the Step 2 broked due to rank = ranking.size() + 1
; then I realized I need comparing its length with the last ranking item.
BiConsumer<List<Map.Entry<Integer, String>>, String> accumulator = (ranking, it) -> {
int rank = ranking.size() + 1;
if (!ranking.isEmpty()) {
Map.Entry<Integer, String> last = ranking.get(ranking.size() - 1);
if (last.getValue().length() == it.length()) {
rank = last.getKey();
}
}
ranking.add(new AbstractMap.SimpleEntry<>(rank, it));
};
List<String> ranking = Arrays.stream(sentence.split(" "))
.sorted(Comparator.comparing(String::length).reversed()
.thenComparing(String::compareTo))
.collect(ArrayList::new, accumulator, List::addAll).stream()
.map(it -> String.format("%d - %s (%d)", it.getKey(), it.getValue(), it.getValue().length()))
.collect(toList());
Indeed, I find I can replace an ArrayList with a Stack in further:
BiConsumer<Stack<Map.Entry<Integer, String>>, String> accumulator = (ranking, it) -> {
int rank = ranking.size() + 1;
if (!ranking.isEmpty()) {
Map.Entry<Integer, String> last = ranking.peek();
if (last.getValue().length() == it.length()) {
rank = last.getKey();
}
}
ranking.add(new AbstractMap.SimpleEntry<>(rank, it));
};
List<String> ranking = Arrays.stream(sentence.split(" "))
.sorted(Comparator.comparing(String::length).reversed()
.thenComparing(String::compareTo))
.collect(Stack::new, accumulator, List::addAll).stream()
.map(it -> String.format("%d - %s (%d)", it.getKey(), it.getValue(), it.getValue().length()))
.collect(toList());
After finishing the code, I found there is a bug existed in the code. let's say ranking "fuzz buzz foo"
will represented as ["1 - buzz (4)","1 - fuzz (4)","2 - foo (3)"]
, I solved it by calculate rank as last.rank + 1
rather than ranking.size() + 1
.
BiConsumer<Stack<Entry<Integer, String>>, String> accumulator = (ranking, it) -> {
int rank;
if (!ranking.isEmpty()) {
Entry<Integer, String> last = ranking.peek();
if (last.getValue().length() == it.length()) {
rank = last.getKey();
} else {
rank = last.getKey() + 1;
}
} else {
rank = 1;
}
ranking.add(new AbstractMap.SimpleEntry<>(rank, it));
};
List<String> ranking = Arrays.stream(sentence.split(" "))
.sorted(comparing(String::length).reversed()
.thenComparing(String::compareTo))
.collect(Stack::new, accumulator, List::addAll).stream()
.map(it -> format("%d - %s (%d)", it.getKey(), it.getValue(), it.getValue().length()))
.collect(toList());
I rename variable accumulator
with a meaningful name, e.g:rankingEvaluation
, and replace null
with Null Object, and .etc.
Map.Entry<Integer, String> NONE = new AbstractMap.SimpleEntry<>(0, "");
BiConsumer<Stack<Entry<Integer, String>>, String> rankingEvaluation = (ranking, it) -> {
Entry<Integer, String> last = ranking.isEmpty() ? NONE : ranking.peek();
int rank = last.getValue().length() == it.length()
? last.getKey()
: last.getKey() + 1;
ranking.add(new AbstractMap.SimpleEntry<>(rank, it));
};
List<String> ranking = Arrays.stream(sentence.split(" "))
.sorted(comparing(String::length).reversed()
.thenComparing(String::compareTo))
.collect(Stack::new, rankingEvaluation, List::addAll).stream()
.map(it -> format("%d - %s (%d)", it.getKey(), it.getValue(), it.getValue().length()))
.collect(toList());
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With