I have Stream of Stream of Words(This format is not set by me and cannot be changed). For ex
Stream<String> doc1 = Stream.of("how", "are", "you", "doing", "doing", "doing");
Stream<String> doc2 = Stream.of("what", "what", "you", "upto");
Stream<String> doc3 = Stream.of("how", "are", "what", "how");
Stream<Stream<String>> docs = Stream.of(doc1, doc2, doc3);
I'm trying to get this into a structure of Map<String, Multiset<Integer>>
(or its corresponding stream as I want to process this further), where the key String
is the word itself and the Multiset<Integer>
represents the number of that word appearances in each document (0's should be excluded). Multiset is a google guava class(not from java.util.).
For example:
how -> {1, 2} // because it appears once in doc1, twice in doc3 and none in doc2(so doc2's count should not be included)
are -> {1, 1} // once in doc1 and once in doc3
you -> {1, 1} // once in doc1 and once in doc2
doing -> {3} // thrice in doc3, none in others
what -> {2,1} // so on
upto -> {1}
What is a good way to do this in Java 8 ?
I tried using a flatMap , but the inner Stream is greatly limiting the options of I have.
Method 1: Using Collectors.toMap() Function The Collectors. toMap() method takes two parameters as the input: KeyMapper: This function is used for extracting keys of the Map from stream value. ValueMapper: This function used for extracting the values of the map for the given key.
Java 8 Stream's map method is intermediate operation and consumes single element forom input Stream and produces single element to output Stream. It simply used to convert Stream of one type to another.
mapper : The reducing operation applies this mapper function to all stream elements. In this example, the mapper retrieves the age of each member. operation : The operation function is used to reduce the mapped values. In this example, the operation function adds Integer values.
Stream map() in Java with examples Stream map(Function mapper) returns a stream consisting of the results of applying the given function to the elements of this stream. Stream map(Function mapper) is an intermediate operation. These operations are always lazy.
Map<String, List<Long>> map = docs.flatMap(
inner -> inner.collect(
Collectors.groupingBy(Function.identity(), Collectors.counting()))
.entrySet()
.stream())
.collect(Collectors.groupingBy(
Entry::getKey,
Collectors.mapping(Entry::getValue, Collectors.toList())));
System.out.println(map);
// {upto=[1], how=[1, 2], doing=[3], what=[2, 1], are=[1, 1], you=[1, 1]}
Since you are using Guava, you could take advantage of its utilities to work with streams. Same with the Table
structure. Here's the code:
Table<String, Long, Long> result =
Streams.mapWithIndex(docs, (doc, i) -> doc.map(word -> new SimpleEntry<>(word, i)))
.flatMap(Function.identity())
.collect(Tables.toTable(
Entry::getKey, Entry::getValue, p -> 1L, Long::sum, HashBasedTable::create));
Here I'm using the Streams.mapWithIndex
method to assign an index to each inner stream. Within the map function, I'm transforming each word to a pair that consists of the word and the index, so that I can later know to which document the word belongs.
Then, I'm flat-mapping the pairs (word, index)
of all documents to one stream, and finally, I'm collecting all the pairs to a Guava Table
by means of the Tables.toTable
collector. The row is the word, the column is the document (represented by the index) and the value is the count of words for each document (I'm assigning 1L
to each different (word, index)
pair and using Long::sum
to merge collisions).
You have all the info you need in the result
table, but if you still need a Map<String, Multiset<Integer>>
, you could do it this way:
Map<String, Multiset<Long>> map = Maps.transformValues(
result.rowMap(),
m -> HashMultiset.create(m.values()));
Note: you need Guava 21 for this to work.
Map<String, Multiset<Integer>> result = docs
.map(s -> s.collect(Collectors.toCollection(HashMultiset::create)))
.flatMap(m -> m.entrySet().stream())
.collect(Collectors.groupingBy(Multiset.Entry::getElement,
Collectors.mapping(Multiset.Entry::getCount,
Collectors.toCollection(HashMultiset::create))));
// {upto=[1], how=[1, 2], doing=[3], what=[1, 2], are=[1 x 2], you=[1 x 2]}
Multiset is useful for getting the word count, but not really necessary for storing the counts. If you're fine with Map<String, List<Integer>>
, just replace the last line with Collectors.toList())));
.
Or, since you're using Guava anyway, why not a ListMultimap?
ListMultimap<String, Integer> result = docs
.map(s -> s.collect(Collectors.toCollection(HashMultiset::create)))
.flatMap(m -> m.entrySet().stream())
.collect(ArrayListMultimap::create,
(r, e) -> r.put(e.getElement(), e.getCount()),
Multimap::putAll);
// {upto=[1], how=[1, 2], doing=[3], what=[2, 1], are=[1, 1], you=[1, 1]}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With