I have a class IndexEntry
which looks like this:
public class IndexEntry implements Comparable<IndexEntry>
{
private String word;
private int frequency;
private int documentId;
...
//Simple getters for all properties
public int getFrequency()
{
return frequency;
}
...
}
I am storing objects of this class in a Guava SortedSetMultimap
(which allows for multiple values per key) where I am mapping a String
word to some IndexEntry
s. Behind the scenes, it maps each word to a SortedSet<IndexEntry>
.
I am trying to implement a sort of indexed structure of words to documents and their occurrence frequencies inside the documents.
I know how to get the count of the most common word, but I can't seem to get the word itself.
Here is what I have to get the count of the most common term, where entries
is the SortedSetMultimap
, along with helper methods:
public int mostFrequentWordFrequency()
{
return entries
.keySet()
.stream()
.map(this::totalFrequencyOfWord)
.max(Comparator.naturalOrder()).orElse(0);
}
public int totalFrequencyOfWord(String word)
{
return getEntriesOfWord(word)
.stream()
.mapToInt(IndexEntry::getFrequency)
.sum();
}
public SortedSet<IndexEntry> getEntriesOfWord(String word)
{
return entries.get(word);
}
I am trying to learn Java 8 features because they seem really useful. However, I can't seem to get the stream working the way I want. I want to be able to have both the word and it's frequency at the end of the stream, but barring that, if I have the word, I can very easily get the total occurrences of that word.
Currently, I keep ending up with a Stream<SortedSet<IndexEntry>>
, which I can't do anything with. I don't know how to get the most frequent word without the frequencies, but if I have the frequency, I can't seem to keep track of the corresponding word. I tried creating a WordFrequencyPair
POJO class to store both, but then I just had a Stream<SortedSet<WordFrequencyPair>>
, and I couldn't figure out how to map that into something useful.
What am I missing?
I think it would be a better design to use the documentId
as the key of the TreeMultimap
rather than the word
:
import com.google.common.collect.*;
public class Main {
TreeMultimap<Integer, IndexEntry> entries = TreeMultimap.<Integer, IndexEntry>create(Ordering.arbitrary(), Ordering.natural().reverse());
public static void main(String[] args) {
// Add elements to `entries`
// Get the most frequent word in document #1
String mostFrequentWord = entries.get(1).first().getWord();
}
}
class IndexEntry implements Comparable<IndexEntry> {
private String word;
private int frequency;
private int documentId;
public String getWord() {
return word;
}
public int getFrequency() {
return frequency;
}
public int getDocumentId() {
return documentId;
}
@Override
public int compareTo(IndexEntry i) {
return Integer.compare(frequency, i.frequency);
}
}
You can then implement the methods that you had before with the following:
public static int totalFrequencyOfWord(String word) {
return entries.values()
.stream()
.filter(i -> word.equals(i.getWord()))
.mapToInt(IndexEntry::getFrequency)
.sum();
}
/**
* This method iterates through the values of the {@link TreeMultimap},
* searching for {@link IndexEntry} objects which have their {@code word}
* field equal to the parameter, word.
*
* @param word
* The word to search for in every document.
* @return
* A {@link List<Pair<Integer, Integer>>} where each {@link Pair<>}
* will hold the document's ID as its first element and the frequency
* of the word in the document as its second element.
*
* Note that the {@link Pair} object is defined in javafx.util.Pair
*/
public static List<Pair<Integer, Integer>> totalWordUses(String word) {
return entries.values()
.stream()
.filter(i -> word.equals(i.getWord()))
.map(i -> new Pair<>(i.getDocumentId(), i.getFrequency()))
.collect(Collectors.toList());
}
Native solution by JDK:
entries.keySet().stream()
.collect(groupingBy(IndexEntry::getWord, summingInt(IndexEntry::getFrequency)))
.values().stream().max(Comparator.naturalOrder()).orElse(0L);
Or By StreamEx
StreamEx.of(entries.keySet())
.groupingBy(IndexEntry::getWord, summingInt(IndexEntry::getFrequency))
.values().stream().max(Comparator.naturalOrder()).orElse(0L);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With