How to Sort a list of strings and find the 1000 most common values in java

Tags:

In java (either using external libraries or not) I need to take a list of approximately 500,000 values and find the most frequently occurring (mode) 1000. Doing my best to keep the complexity to a minimum.

What I've tried so far, make a hash, but I can't because it would have to be backwards key=count value =string, otherwise when getting the top 1000, my complexity will be garbage. and the backwards way doesn't really work great because I would be having a terrible complexity for insertion as I search for where my string is to be able to remove it and insert it one higher...

I've tried using a binary search tree, but that had the same issue of what the data would be for sorting, either on the count or the string. If it's on the string then getting the count for the top 1000 is bad, and vice versa insertion is bad.

I could sort the list first (by string) and then iterate over the list and keep a count until it changes strings. but what data structure should I use to keep track of the top 1000?

Thanks

841

asked Jul 19 '17 15:07

yankel h

3 Answers

I would first create a Map<String, Long> to store the frequency of each word. Then, I'd sort this map by value in descending order and finally I'd keep the first 1000 entries.

In code:

List<String> top1000Words = listOfWords.stream()
    .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()))
    .entrySet().stream()
        .sorted(Map.Entry.comparingByValue().reversed())
        .limit(1000)
        .map(Map.Entry::getKey)
        .collect(Collectors.toList());

You might find it cleaner to separate the above into 2 steps: first collecting to the map of frequencies and then sorting its entries by value and keeping the first 1000 entries.

160

answered Oct 13 '22 08:10

fps

I'd separate this into three phases:

Count word occurrences (e.g. by using a HashMap<String, Integer>)
Sort the results (e.g. by converting the map into a list of entries and ordering by value descending)
Output the top 1000 entries of the sorted results

The sorting will be slow if the counts are small (e.g. if you've actually got 500,000 separate words) but if you're expecting lots of duplicate words, it should be fine.

answered Oct 13 '22 06:10

Jon Skeet

I have had this question open for a few days now and have decided to rebel against Federico's elegant Java 8 answer and submit the least Java 8 answer possible.

The following code makes use of a helper class that associates a tally with a string.

public class TopOccurringValues {
    static HashMap<String, StringCount> stringCounts = new HashMap<>();

    // set low for demo.  Change to 1000 (or whatever)
    static final int TOP_NUMBER_TO_COLLECT = 10;

    public static void main(String[] args) {
        // load your strings in here
        List<String> strings = loadStrings();

        // tally up string occurrences
        for (String string: strings) {
            StringCount stringCount = stringCounts.get(string);
            if (stringCount == null) {
                stringCount = new StringCount(string);
            }
            stringCount.increment();
            stringCounts.put(string, stringCount);
        }

        // sort which have most
        ArrayList<StringCount> sortedCounts = new ArrayList<>(stringCounts.values());
        Collections.sort(sortedCounts);

        // collect the top occurring strings
        ArrayList<String> topCollection = new ArrayList<>();
        int upperBound = Math.min(TOP_NUMBER_TO_COLLECT, sortedCounts.size());
        System.out.println("string\tcount");
        for (int i = 0; i < upperBound; i++) {
            StringCount stringCount = sortedCounts.get(i);
            topCollection.add(stringCount.string);
            System.out.println(stringCount.string + "\t" + stringCount.count);
        }
    }

    // in this demo, strings are randomly generated numbers.
    private static List<String> loadStrings() {
        Random random = new Random(1);
        ArrayList<String> randomStrings = new ArrayList<>();
        for (int i = 0; i < 5000000; i++) {
            randomStrings.add(String.valueOf(Math.round(random.nextGaussian() * 1000)));
        }
        return randomStrings;
    }

    static class StringCount implements Comparable<StringCount> {
        int count = 0;
        String string;
        StringCount(String string) {this.string = string;}
        void increment() {count++;}
        @Override
        public int compareTo(StringCount o) {return o.count - count;}
    }
}

55 lines of code! It's like reverse code golf. The String generator creates 5 million strings instead of 500,000 because: why not?

string  count
-89 2108
70  2107
77  2085
-4  2077
36  2077
65  2072
-154    2067
-172    2064
194 2063
-143    2062

The randomly generated strings can have values between -999 and 999 but because we are getting gaussian values, we will see numbers with higher scores that are closer to 0.

answered Oct 13 '22 06:10

Brian Risk

Related questions
                            
                                Convert String to URL in android/java [duplicate]
                            
                                Maven issue to build one module using revision property
                            
                                cannot access org.springframework.web.WebApplicationInitializer
                            
                                Left Join in Spring Data JPA's Specification
                            
                                TextView.setText(int) causes app to crash
                            
                                How to refactor a large class with many methods that have single responsibility?
                            
                                Partial Matching of Regular Expressions
                            
                                Failed to convert from type [java.lang.Object[]] to type
                            
                                logback deletes the logs before the MaxHistory during hourly rollback
                            
                                How to add bean programmatically to Spring context?
                            
                                How to use Redirect in ModelAndView
                            
                                Corresponding Spring JPA method name for requesting a random row in SQL
                            
                                Why does Array elements have to be of the same type
                            
                                Spring Test returning 404 instead of 200 in testing Rest API in Spring Boot
                            
                                Eclipse not launching for MacOS Sierra
                            
                                Print HashMap Values with Stream api
                            
                                Why can't I access static methods of an interface using an instance variable
                            
                                Java 8 streams do something with an element which has a boolean field
                            
                                Setting a boolean flag inside Java 8 Stream
                            
                                How to find the child elements of a specific WebElement (which I don't know the xpath to)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to Sort a list of strings and find the 1000 most common values in java

Tags:

java

sorting

data-structures

yankel h

People also ask

3 Answers

fps

Jon Skeet

Brian Risk

Recent Activity

Donate For Us