Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to Sort a list of strings and find the 1000 most common values in java

In java (either using external libraries or not) I need to take a list of approximately 500,000 values and find the most frequently occurring (mode) 1000. Doing my best to keep the complexity to a minimum.

What I've tried so far, make a hash, but I can't because it would have to be backwards key=count value =string, otherwise when getting the top 1000, my complexity will be garbage. and the backwards way doesn't really work great because I would be having a terrible complexity for insertion as I search for where my string is to be able to remove it and insert it one higher...

I've tried using a binary search tree, but that had the same issue of what the data would be for sorting, either on the count or the string. If it's on the string then getting the count for the top 1000 is bad, and vice versa insertion is bad.

I could sort the list first (by string) and then iterate over the list and keep a count until it changes strings. but what data structure should I use to keep track of the top 1000?

Thanks

like image 841
yankel h Avatar asked Jul 19 '17 15:07

yankel h


People also ask

How do I sort strings that contain numbers in Java?

Get the String. Create an empty integer array. The split() method of the string class accepts a string representing a delimiter, splits the current string into an array of tokens.


3 Answers

I would first create a Map<String, Long> to store the frequency of each word. Then, I'd sort this map by value in descending order and finally I'd keep the first 1000 entries.

In code:

List<String> top1000Words = listOfWords.stream()
    .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()))
    .entrySet().stream()
        .sorted(Map.Entry.comparingByValue().reversed())
        .limit(1000)
        .map(Map.Entry::getKey)
        .collect(Collectors.toList());

You might find it cleaner to separate the above into 2 steps: first collecting to the map of frequencies and then sorting its entries by value and keeping the first 1000 entries.

like image 160
fps Avatar answered Oct 13 '22 08:10

fps


I'd separate this into three phases:

  • Count word occurrences (e.g. by using a HashMap<String, Integer>)
  • Sort the results (e.g. by converting the map into a list of entries and ordering by value descending)
  • Output the top 1000 entries of the sorted results

The sorting will be slow if the counts are small (e.g. if you've actually got 500,000 separate words) but if you're expecting lots of duplicate words, it should be fine.

like image 34
Jon Skeet Avatar answered Oct 13 '22 06:10

Jon Skeet


I have had this question open for a few days now and have decided to rebel against Federico's elegant Java 8 answer and submit the least Java 8 answer possible.

The following code makes use of a helper class that associates a tally with a string.

public class TopOccurringValues {
    static HashMap<String, StringCount> stringCounts = new HashMap<>();

    // set low for demo.  Change to 1000 (or whatever)
    static final int TOP_NUMBER_TO_COLLECT = 10;

    public static void main(String[] args) {
        // load your strings in here
        List<String> strings = loadStrings();

        // tally up string occurrences
        for (String string: strings) {
            StringCount stringCount = stringCounts.get(string);
            if (stringCount == null) {
                stringCount = new StringCount(string);
            }
            stringCount.increment();
            stringCounts.put(string, stringCount);
        }

        // sort which have most
        ArrayList<StringCount> sortedCounts = new ArrayList<>(stringCounts.values());
        Collections.sort(sortedCounts);

        // collect the top occurring strings
        ArrayList<String> topCollection = new ArrayList<>();
        int upperBound = Math.min(TOP_NUMBER_TO_COLLECT, sortedCounts.size());
        System.out.println("string\tcount");
        for (int i = 0; i < upperBound; i++) {
            StringCount stringCount = sortedCounts.get(i);
            topCollection.add(stringCount.string);
            System.out.println(stringCount.string + "\t" + stringCount.count);
        }
    }

    // in this demo, strings are randomly generated numbers.
    private static List<String> loadStrings() {
        Random random = new Random(1);
        ArrayList<String> randomStrings = new ArrayList<>();
        for (int i = 0; i < 5000000; i++) {
            randomStrings.add(String.valueOf(Math.round(random.nextGaussian() * 1000)));
        }
        return randomStrings;
    }

    static class StringCount implements Comparable<StringCount> {
        int count = 0;
        String string;
        StringCount(String string) {this.string = string;}
        void increment() {count++;}
        @Override
        public int compareTo(StringCount o) {return o.count - count;}
    }
}

55 lines of code! It's like reverse code golf. The String generator creates 5 million strings instead of 500,000 because: why not?

string  count
-89 2108
70  2107
77  2085
-4  2077
36  2077
65  2072
-154    2067
-172    2064
194 2063
-143    2062

The randomly generated strings can have values between -999 and 999 but because we are getting gaussian values, we will see numbers with higher scores that are closer to 0.

like image 20
Brian Risk Avatar answered Oct 13 '22 06:10

Brian Risk