Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java: how to find top 10 most common String + frequency in ArrayList?

I am trying to find the top 10 most common Strings in an ArrayList + their count (frequency of occurrence).

How may I do this with the best time complexity?

The below code finds the most common word + frequency in the form (String=int)

E.g. the=2

  public static Entry<String, Integer> get10MostCommon(WordStream words) {

    ArrayList<String> list = new ArrayList<String>();
    Map<String, Integer> stringsCount = new HashMap<>();
    Map.Entry<String, Integer> mostRepeated = null;

    for (String i : words) {
      list.add(i);
    }

    for (String s : list) {
      Integer c = stringsCount.get(s);
      if (c == null)
        c = new Integer(0);
      c++;
      stringsCount.put(s, c);
    }

    for (Map.Entry<String, Integer> e : stringsCount.entrySet()) {
      if (mostRepeated == null || mostRepeated.getValue() < e.getValue())
        mostRepeated = e;
    }
    return mostRepeated;
  }
like image 557
Iona Avatar asked Mar 14 '16 16:03

Iona


4 Answers

You could do it in two steps, using Java 8 streams:

Map<String, Long> map = list.stream()
        .collect(Collectors.groupingBy(w -> w, Collectors.counting()));

List<Map.Entry<String, Long>> result = map.entrySet().stream()
        .sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
        .limit(10)
        .collect(Collectors.toList());

The first stream maps words to their frequency by using Collectors.groupingBy() along with Collectors.counting().

This returns a map, whose entries are streamed and sorted by map entry value, in reverse order. Then, the stream is limited to keep only 10 elements, which are finally collected to a list.

like image 145
fps Avatar answered Sep 28 '22 04:09

fps


You will have to iterate the words from start to end at least once, so you won't end better than O(n), where n is the words size. Then there's extraction of m top entries (10 in your case). Suppose you have k unique words in your n words in total, to find m top entries, you need to run max search m times on k entries, which results in m * k operations, giving you O(m * n) in worst case (when all words are unique). In total, this gives you O(n * (m + 1)) operations, or O(11 * n) in your case (10 times max search plus the initial grouping run).

Here's my try (JDK8+, not tested):

public static Collection<Map.Entry<String, Integer>> topOccurences(List<String> words, int topThreshold) {
    Map<String, Integer> occurences = new HashMap<>();
    words.stream().forEach((word) -> {
        int count = 1;
        if (occurences.containsKey(word)) {
            count = occurences.get(word) + 1;
        }
        occurences.put(word, count);
    });

    List<Map.Entry<String, Integer>> entries = new LinkedList<>(occurences.entrySet());
    List<Map.Entry<String, Integer>> tops = new LinkedList<>();
    Comparator<Map.Entry<String, Integer>> valueComp = Comparator.comparing((Map.Entry<String, Integer> t) -> t.getValue());
    int topcount = 0;
    while (topcount < topThreshold && !entries.isEmpty()) {
        Map.Entry<String, Integer> max = Collections.max(entries, valueComp);
        tops.add(max);
        entries.remove(max);
        topcount++;
    }
    return tops;
}
like image 36
hoefling Avatar answered Sep 28 '22 03:09

hoefling


I would decompose this into two methods.

The first would do nothing but create the Map of word frequencies.

The second would return the n highest frequency words.

What should your code do if you ask for n most frequent words but the Map has fewer than that number as keys?

It's your chance to try JDK 8 lambdas and filter the frequency Map efficiently.

import java.util.Arrays;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;

/**
 * Calculate word frequencies from a List of words
 * User: mduffy
 * Date: 3/14/2016
 * Time: 1:07 PM
 * @link http://stackoverflow.com/questions/35992891/java-how-to-find-top-10-most-common-string-frequency-in-arraylist/35993252#35993252
 */
public class WordFrequencyDemo {

    public static void main(String[] args) {
        List<String> words = Arrays.asList(args);
        Map<String, Integer> wordFrequencies = WordFrequencyDemo.getWordFrequencies(words);
        System.out.println(wordFrequencies);
    }

    public static Map<String, Integer> getWordFrequencies(List<String> words) {
        Map<String, Integer> wordFrequencies = new LinkedHashMap<String, Integer>();
        if (words != null) {
            for (String word : words) {
                if (word != null) {
                    word = word.trim();
                    if (!wordFrequencies.containsKey(word)) {
                        wordFrequencies.put(word, 0);
                    }
                    int count = wordFrequencies.get(word);
                    wordFrequencies.put(word, ++count);
                }
            }
        }
        return wordFrequencies;
    }
}
like image 38
duffymo Avatar answered Sep 28 '22 05:09

duffymo


You would always use a hash to count the words first, which will of course use O(n) time and O(n) space. This is the first step.

Then it comes to how to select the top 10. You could use a sorting which takes at least O(nlogn) time. But there is a better way, which is to use a heap. Let's say k = 10 in your case. You need to add pair object of word and its frequency into a min heap of size k where we use the frequency as the key for the min heap. If the heap is full then remove the minimum element (top) from the heap and add the new word-frequency pair only if the frequency of this word has frequency greater than the top word in the heap. Once we scanned all the words in the map and the heap is properly updated then the elements contained in the min heap are the top k most frequents. Below is the example code. Just modify the code a little bit to take a ArrayList rather than an array will do your work.

class Pair {
    String key;
    int value;

    Pair(String key, int value) {
        this.key = key;
        this.value = value;
    }
}

public class Solution {
    /**
     * @param words an array of string
     * @param k an integer
     * @return an array of string
     */

    private Comparator<Pair> pairComparator = new Comparator<Pair>() {
        public int compare(Pair left, Pair right) {
            if (left.value != right.value) {
                return left.value - right.value;
            }
            return right.key.compareTo(left.key);
        }
    };

    public String[] topKFrequentWords(String[] words, int k) {
        if (k == 0) {
            return new String[0];
        }

        HashMap<String, Integer> counter = new HashMap<>();
        for (String word : words) {
            if (counter.containsKey(word)) {
                counter.put(word, counter.get(word) + 1);
            } else {
                counter.put(word, 1);
            }
        }

        PriorityQueue<Pair> Q = new PriorityQueue<Pair>(k, pairComparator);
        for (String word : counter.keySet()) {
            Pair peak = Q.peek();
            Pair newPair = new Pair(word, counter.get(word));
            if (Q.size() < k) {
                Q.add(newPair);
            } else if (pairComparator.compare(newPair, peak) > 0) {
                Q.poll();
                Q.add(new Pair(word, counter.get(word)));
            }
        }

        String[] result = new String[k];
        int index = 0;
        while (!Q.isEmpty()) {
            result[index++] = Q.poll().key;
        }

        // reverse
        for (int i = 0; i < index / 2; i++) {
            String temp = result[i];
            result[i] = result[index - i - 1];
            result[index - i - 1] = temp;
        }

        return result;
    }
}
like image 41
lkq Avatar answered Sep 28 '22 03:09

lkq