I have Stream of Stream of Words(This format is not set by me and cannot be changed). For ex <pre class="prettyprint"><code>Stream<String> doc1 = Stream.of("how", "are", "you", "doing", "doing", "doing"); Stream<String> doc2 = Stream.of("what", "what", "you", "upto"); Stream<String> doc3 = Stream.of("how", "are", "what", "how"); Stream<Stream<String>> docs = Stream.of(doc1, doc2, doc3); </code></pre> I'm trying to get this into a structure of <code>Map<String, Multiset<Integer>></code> (or its corresponding stream as I want to process this further), where the key <code>String</code> is the word itself and the <code>Multiset<Integer></code> represents the number of that word appearances in each document (0's should be excluded). Multiset is a google guava class(not from java.util.). For example: <pre class="prettyprint"><code>how -> {1, 2} // because it appears once in doc1, twice in doc3 and none in doc2(so doc2's count should not be included) are -> {1, 1} // once in doc1 and once in doc3 you -> {1, 1} // once in doc1 and once in doc2 doing -> {3} // thrice in doc3, none in others what -> {2,1} // so on upto -> {1} </code></pre> What is a good way to do this in Java 8 ? I tried using a flatMap , but the inner Stream is greatly limiting the options of I have.

Since you are using Guava, you could take advantage of its utilities to work with streams. Same with the <code>Table</code> structure. Here's the code: <pre class="prettyprint"><code>Table<String, Long, Long> result = Streams.mapWithIndex(docs, (doc, i) -> doc.map(word -> new SimpleEntry<>(word, i))) .flatMap(Function.identity()) .collect(Tables.toTable( Entry::getKey, Entry::getValue, p -> 1L, Long::sum, HashBasedTable::create)); </code></pre> Here I'm using the <code>Streams.mapWithIndex</code> method to assign an index to each inner stream. Within the map function, I'm transforming each word to a pair that consists of the word and the index, so that I can later know to which document the word belongs. Then, I'm flat-mapping the pairs <code>(word, index)</code> of all documents to one stream, and finally, I'm collecting all the pairs to a Guava <code>Table</code> by means of the <code>Tables.toTable</code> collector. The row is the word, the column is the document (represented by the index) and the value is the count of words for each document (I'm assigning <code>1L</code> to each different <code>(word, index)</code> pair and using <code>Long::sum</code> to merge collisions). You have all the info you need in the <code>result</code> table, but if you still need a <code>Map<String, Multiset<Integer>></code>, you could do it this way: <pre class="prettyprint"><code>Map<String, Multiset<Long>> map = Maps.transformValues( result.rowMap(), m -> HashMultiset.create(m.values())); </code></pre> Note: you need Guava 21 for this to work.

<pre class="prettyprint"><code>Map<String, Multiset<Integer>> result = docs .map(s -> s.collect(Collectors.toCollection(HashMultiset::create))) .flatMap(m -> m.entrySet().stream()) .collect(Collectors.groupingBy(Multiset.Entry::getElement, Collectors.mapping(Multiset.Entry::getCount, Collectors.toCollection(HashMultiset::create)))); // {upto=[1], how=[1, 2], doing=[3], what=[1, 2], are=[1 x 2], you=[1 x 2]} </code></pre> Multiset is useful for getting the word count, but not really necessary for storing the counts. If you're fine with <code>Map<String, List<Integer>></code>, just replace the last line with <code>Collectors.toList())));</code>. Or, since you're using Guava anyway, why not a ListMultimap? <pre class="prettyprint"><code>ListMultimap<String, Integer> result = docs .map(s -> s.collect(Collectors.toCollection(HashMultiset::create))) .flatMap(m -> m.entrySet().stream()) .collect(ArrayListMultimap::create, (r, e) -> r.put(e.getElement(), e.getCount()), Multimap::putAll); // {upto=[1], how=[1, 2], doing=[3], what=[2, 1], are=[1, 1], you=[1, 1]} </code></pre>

Extracting Map<K, Multiset<V>> from Stream of Streams in Java 8

Tags:

java

java-8

java-stream

I have Stream of Stream of Words(This format is not set by me and cannot be changed). For ex

Stream<String> doc1 = Stream.of("how", "are", "you", "doing", "doing", "doing");
Stream<String> doc2 = Stream.of("what", "what", "you", "upto");
Stream<String> doc3 = Stream.of("how", "are", "what", "how");
Stream<Stream<String>> docs = Stream.of(doc1, doc2, doc3);

I'm trying to get this into a structure of Map<String, Multiset<Integer>> (or its corresponding stream as I want to process this further), where the key String is the word itself and the Multiset<Integer> represents the number of that word appearances in each document (0's should be excluded). Multiset is a google guava class(not from java.util.).

For example:

how   -> {1, 2}  // because it appears once in doc1, twice in doc3 and none in doc2(so doc2's count should not be included)
are   -> {1, 1}  // once in doc1 and once in doc3
you   -> {1, 1}  // once in doc1 and once in doc2
doing -> {3}     // thrice in doc3, none in others 
what  -> {2,1}   // so on
upto  -> {1}

What is a good way to do this in Java 8 ?

I tried using a flatMap , but the inner Stream is greatly limiting the options of I have.

808

asked May 26 '17 18:05

Anoop

3 Answers

 Map<String, List<Long>> map = docs.flatMap(
            inner -> inner.collect(
                    Collectors.groupingBy(Function.identity(), Collectors.counting()))
                    .entrySet()
                    .stream())
            .collect(Collectors.groupingBy(
                    Entry::getKey,
                    Collectors.mapping(Entry::getValue, Collectors.toList())));

System.out.println(map);

// {upto=[1], how=[1, 2], doing=[3], what=[2, 1], are=[1, 1], you=[1, 1]}

128

answered Sep 24 '22 11:09

Eugene

Since you are using Guava, you could take advantage of its utilities to work with streams. Same with the Table structure. Here's the code:

Table<String, Long, Long> result =
    Streams.mapWithIndex(docs, (doc, i) -> doc.map(word -> new SimpleEntry<>(word, i)))
        .flatMap(Function.identity())
        .collect(Tables.toTable(
            Entry::getKey, Entry::getValue, p -> 1L, Long::sum, HashBasedTable::create));

Here I'm using the Streams.mapWithIndex method to assign an index to each inner stream. Within the map function, I'm transforming each word to a pair that consists of the word and the index, so that I can later know to which document the word belongs.

Then, I'm flat-mapping the pairs (word, index) of all documents to one stream, and finally, I'm collecting all the pairs to a Guava Table by means of the Tables.toTable collector. The row is the word, the column is the document (represented by the index) and the value is the count of words for each document (I'm assigning 1L to each different (word, index) pair and using Long::sum to merge collisions).

You have all the info you need in the result table, but if you still need a Map<String, Multiset<Integer>>, you could do it this way:

Map<String, Multiset<Long>> map = Maps.transformValues(
    result.rowMap(),
    m -> HashMultiset.create(m.values()));

Note: you need Guava 21 for this to work.

answered Sep 21 '22 11:09

fps

Map<String, Multiset<Integer>> result = docs
        .map(s -> s.collect(Collectors.toCollection(HashMultiset::create)))
        .flatMap(m -> m.entrySet().stream())
        .collect(Collectors.groupingBy(Multiset.Entry::getElement,
                Collectors.mapping(Multiset.Entry::getCount,
                        Collectors.toCollection(HashMultiset::create))));

// {upto=[1], how=[1, 2], doing=[3], what=[1, 2], are=[1 x 2], you=[1 x 2]}

Multiset is useful for getting the word count, but not really necessary for storing the counts. If you're fine with Map<String, List<Integer>>, just replace the last line with Collectors.toList())));.

Or, since you're using Guava anyway, why not a ListMultimap?

ListMultimap<String, Integer> result = docs
        .map(s -> s.collect(Collectors.toCollection(HashMultiset::create)))
        .flatMap(m -> m.entrySet().stream())
        .collect(ArrayListMultimap::create,
                (r, e) -> r.put(e.getElement(), e.getCount()),
                Multimap::putAll);

// {upto=[1], how=[1, 2], doing=[3], what=[2, 1], are=[1, 1], you=[1, 1]}

answered Sep 21 '22 11:09

Sean Van Gorder

Related questions
                            
                                SimpleDateFormat ignores "XXX" if timezone is set to "UTC"
                            
                                Remove transitive classpath dependency in gradle
                            
                                How to write ArrayList<Object> to a csv file
                            
                                Java Streams - Standard Deviation
                            
                                Unable to generate signed APK with proguard enabled when Joda Time is used
                            
                                Java use getter in for loop or create a local variable? [duplicate]
                            
                                Find the only unique element in an array of a million elements
                            
                                Thread Synchronization on Integer instance variable
                            
                                How to add a RequestContextListener with no-xml configuration?
                            
                                MockBean annotation in Spring Boot test causes NoUniqueBeanDefinitionException
                            
                                Spring boot actuator MySQL database health check
                            
                                How much does Java optimize string concatenation with +?
                            
                                Maven build error after setting toolchain right
                            
                                Spring repository method which are returning Java 8 stream doesn't close JDBC connection
                            
                                Using a kotlin library in java code
                            
                                How to convert Array to HashMap using Java 8 Stream
                            
                                What is the fastest way to set an arbitrary range of elements in a Java array to null?
                            
                                How to hide Kotlin's lateinit var backing field from Java?
                            
                                Get the url of a redirect with Rest Assured?
                            
                                IntelliJ IDEA spring boot hot reload on manual save?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With