In java (either using external libraries or not) I need to take a list of approximately 500,000 values and find the most frequently occurring (mode) 1000. Doing my best to keep the complexity to a minimum.
What I've tried so far, make a hash, but I can't because it would have to be backwards key=count value =string, otherwise when getting the top 1000, my complexity will be garbage. and the backwards way doesn't really work great because I would be having a terrible complexity for insertion as I search for where my string is to be able to remove it and insert it one higher...
I've tried using a binary search tree, but that had the same issue of what the data would be for sorting, either on the count or the string. If it's on the string then getting the count for the top 1000 is bad, and vice versa insertion is bad.
I could sort the list first (by string) and then iterate over the list and keep a count until it changes strings. but what data structure should I use to keep track of the top 1000?
Thanks
Get the String. Create an empty integer array. The split() method of the string class accepts a string representing a delimiter, splits the current string into an array of tokens.
I would first create a Map<String, Long>
to store the frequency of each word. Then, I'd sort this map by value in descending order and finally I'd keep the first 1000
entries.
In code:
List<String> top1000Words = listOfWords.stream()
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()))
.entrySet().stream()
.sorted(Map.Entry.comparingByValue().reversed())
.limit(1000)
.map(Map.Entry::getKey)
.collect(Collectors.toList());
You might find it cleaner to separate the above into 2 steps: first collecting to the map of frequencies and then sorting its entries by value and keeping the first 1000 entries.
I'd separate this into three phases:
HashMap<String, Integer>
)The sorting will be slow if the counts are small (e.g. if you've actually got 500,000 separate words) but if you're expecting lots of duplicate words, it should be fine.
I have had this question open for a few days now and have decided to rebel against Federico's elegant Java 8 answer and submit the least Java 8 answer possible.
The following code makes use of a helper class that associates a tally with a string.
public class TopOccurringValues {
static HashMap<String, StringCount> stringCounts = new HashMap<>();
// set low for demo. Change to 1000 (or whatever)
static final int TOP_NUMBER_TO_COLLECT = 10;
public static void main(String[] args) {
// load your strings in here
List<String> strings = loadStrings();
// tally up string occurrences
for (String string: strings) {
StringCount stringCount = stringCounts.get(string);
if (stringCount == null) {
stringCount = new StringCount(string);
}
stringCount.increment();
stringCounts.put(string, stringCount);
}
// sort which have most
ArrayList<StringCount> sortedCounts = new ArrayList<>(stringCounts.values());
Collections.sort(sortedCounts);
// collect the top occurring strings
ArrayList<String> topCollection = new ArrayList<>();
int upperBound = Math.min(TOP_NUMBER_TO_COLLECT, sortedCounts.size());
System.out.println("string\tcount");
for (int i = 0; i < upperBound; i++) {
StringCount stringCount = sortedCounts.get(i);
topCollection.add(stringCount.string);
System.out.println(stringCount.string + "\t" + stringCount.count);
}
}
// in this demo, strings are randomly generated numbers.
private static List<String> loadStrings() {
Random random = new Random(1);
ArrayList<String> randomStrings = new ArrayList<>();
for (int i = 0; i < 5000000; i++) {
randomStrings.add(String.valueOf(Math.round(random.nextGaussian() * 1000)));
}
return randomStrings;
}
static class StringCount implements Comparable<StringCount> {
int count = 0;
String string;
StringCount(String string) {this.string = string;}
void increment() {count++;}
@Override
public int compareTo(StringCount o) {return o.count - count;}
}
}
55 lines of code! It's like reverse code golf. The String generator creates 5 million strings instead of 500,000 because: why not?
string count
-89 2108
70 2107
77 2085
-4 2077
36 2077
65 2072
-154 2067
-172 2064
194 2063
-143 2062
The randomly generated strings can have values between -999 and 999 but because we are getting gaussian values, we will see numbers with higher scores that are closer to 0.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With