Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stream filter/reduction on duplicated entries

I am attempting to filter/reduce a stream of data that has some duplicated entries in it.

In essence, I attempting to find a better solution to filtering a set of data than what I implemented. We have data that, at its base, are something like this:

Action | Date         | Detail
15     | 2016-03-15   | 
5      | 2016-03-15   | D1
5      | 2016-09-25   | D2      <--
5      | 2016-09-25   | D3      <-- same day, different detail
4      | 2017-02-08   | D4
4      | 2017-02-08   | D5
5      | 2017-03-01   | D6      <--
5      | 2017-03-05   | D6      <-- different day, same detail; need earliest
5      | 2017-03-08   | D7
5      | 2017-03-10   | D8
...

I need to extract the details such that:

  • Only action 5 is selected
  • If a detail is the same (e.g, D6 appears twice on different days), the earliest date is selected

These data are loaded into Objects (one instance for each "record"), and there are other fields on the Object but they are not relevant for this filtering. The Detail is stored as a String, the Date as a ZonedDateTime, and the Action is an int (well, actually an enum, but here shown as an int). The Objects are given in a List<Entry> in chronological order.

I was able to get a working, but what I consider to be suboptimal, solution by doing:

  List<Entry> entries = getEntries(); // retrieved from a server

  final Set<String> update = new HashSet<>();
  List<Entry> updates =
  entries.stream()
    .filter(e -> e.getType() == 5)
    .filter(e -> pass(e, update))
    .collect(Collectors.toList());


private boolean pass(Entry ehe, Set<String> update)
   {
     final String val =  ehe.getDetail();
     if (update.contains(val)) { return false; }
     update.add(val);
     return true;
   }

But the issue is I had to use this pass() method and in it checking a Set<String> to maintain whether a given Detail had alreay been processed. While this approach works, it seems like it should be possible to avoid an external reference.

I tried to use a groupingBy on the Detail, and it would allow extracting the earliest entry from the list, the problem was I no longer had a date ordering and I had to process the resultant Map<String,List<Entry>>.

It seems like some reduce operation (if I used that term correctly) here without the use of the pass() method should be possible, but I am struggling to get a better implementation.

What would be a better approach such that the .filter(e -> pass(e, update)) could be removed?

Thank you!

like image 621
KevinO Avatar asked May 04 '17 11:05

KevinO


People also ask

How do I filter duplicates in stream?

You can use the Stream. distinct() method to remove duplicates from a Stream in Java 8 and beyond. The distinct() method behaves like a distinct clause of SQL, which eliminates duplicate rows from the result set.

What is the purpose of filter () method in streams?

The filter() function of the Java stream allows you to narrow down the stream's items based on a criterion. If you only want items that are even on your list, you can use the filter method to do this. This method accepts a predicate as an input and returns a list of elements that are the results of that predicate.

Does stream filter modify the original array?

stream(). filter(i -> i >= 3); does not change original list. All stream operations are non-interfering (none of them modify the data source), as long as the parameters that you give to them are non-interfering too.


2 Answers

Two solutions in this answer of which the second is significantly faster.

Solution 1

An adaptation of the answer by Ole V.V. on another question:

Collection<Entry> result = 
 entries.stream().filter(e -> e.getAction() == 5)
  .collect(Collectors.groupingBy(Entry::getDetail, Collectors.collectingAndThen(Collectors.minBy(Comparator.comparing(Entry::getDate)), Optional::get)))
  .values();

With your example dataset you end up with (I picked GMT+0 as timezone):

Entry [action=5, date=2017-03-01T00:00Z[GMT], detail=D6]
Entry [action=5, date=2017-03-08T00:00Z[GMT], detail=D7]
Entry [action=5, date=2017-03-10T00:00Z[GMT], detail=D8]
Entry [action=5, date=2016-03-15T00:00Z[GMT], detail=D1]
Entry [action=5, date=2016-09-25T00:00Z[GMT], detail=D2]
Entry [action=5, date=2016-09-25T00:00Z[GMT], detail=D3]

If you insist on getting a List back:

List<Entry> result = new ArrayList<>(entries.stream() ..... .values());

If you want to get your original order back use the 3-parameter groupingBy:

...groupingBy(Entry::getDetail, LinkedHashMap::new, Collectors.collectingAndThen(...))

Solution 2

Using toMap, which is easier to read and is faster (see comment by holi-java on this answer, and next 'section'):

List<Entry> col = new ArrayList<>(
  entries.stream().filter(e -> e.getAction() == 5)
  .collect(Collectors.toMap(Entry::getDetail, Function.identity(), (a,b) -> a.getDate().compareTo(b.getDate()) >= 0 ? b : a))
  .values());

where (a,b) -> a.getDate().compareTo(b.getDate()) >= 0 ? b : a can be substituted with:

BinaryOperator.minBy(Comparator.comparing(Entry::getDate))

If you want to get your original order back in this solution use the 4-parameter toMap:

...toMap(Entry::getDetail, Function.identity(), (a,b) -> a.getDate().compareTo(b.getDate()) >= 0 ? b : a, LinkedHashMap::new)

Performance

With the testdata I created for testing my solutions, I checked the runtime of both solutions. The first solution takes on average 67 ms (ran it only 20 times, so don't trust the numbers!), the second solution took 2 ms on average. If anyone would like to do a proper performance comparison, put the results in the comments and I'll add it here.

like image 130
Robin Topper Avatar answered Oct 08 '22 10:10

Robin Topper


If I understood correctly...

 List<Entry> result = list.stream().collect(Collectors.toMap(
            Entry::getDetail,
            Function.identity(),
            (left, right) -> {
                return left.getDate().compareTo(right.getDate()) > 0 ? right : left;
            }, LinkedHashMap::new))
            .values()
            .stream()
            .filter(e -> e.getAction() == 5)
            .collect(Collectors.toList());
like image 41
Eugene Avatar answered Oct 08 '22 12:10

Eugene