JS: writing a function that iterates through a list of strings and returns the top 10 most frequent strings in the list

Tags:

algorithm

I am trying to write a function that iterates through a list of strings and returns the top 10 most frequent strings in the list. I am trying to come up with multiple solutions to this question

Here is my first solution

const list = [
    "this",
    "is",
    "a",
    "test",
    "which",
    "word",
    "wins",
    "top",
    "i",
    "don't",
    "know",
    "off",
    "hand",
    "do",
    "you",
    "this",
    "a",
    "a",
    "this",
    "test",
    "a",
    "a",
    "do",
    "hand",
    "hand",
    "a",
    "whatever",
    "what",
    "do",
    "do"
  ];

function fn1(strArr) {
    const map = new Map()
    for(const str of strArr) {
        if(map.has(str)) {
            map.set(str, map.get(str) + 1)
        } else {
            map.set(str, 1)
        }
    }
    const sortedMap =[...map.entries()].sort(([_,a], [__,b]) => a < b ? 1 : -1)
    return sortedMap.slice(0 , 10).map(([str]) => str)
}

But I cannot seem to find any other solutions to this question. Can anyone suggest an alternative suggestion?

Also, one thing to note that is the list can be really large, maybe contain 1 million strings. So we need to try to minimize the runtime complexity

483

asked Dec 24 '20 05:12

1 Answers

I believe the following solution is the fastest in practice:

Map the n strings to their frequencies, as you have already done. O(n)
Convert the map into an array with string/frequency pairs. O(n)
Convert the array into a Max-heap based on the frequency using Floyd's method (i.e., by calling max-heapify for all indices from ⌊n/2⌋ - 1 down to 0). O(n)
Extract the top element k times. (In your case, k=10.) O(k log n)

(I don't provide any code here because it basically consists of calling a binary heap library (see here for a highly optimized implementation).)

Analysis

The asymptotic complexity is O(n + k log n) which is less than O(n log k) from the solution provided by Chaos Monkey. For small k (e.g. 10), there probably won't be any significant difference. The difference becomes more apparent for larger k (as long as k ≪ n; for much larger k, see "Alternative" below). Also note that the constant factor for Steps 1 and 2 is 1, and the constant factor in Step 3 is also small on average (1.8814), so the overall constant factor is less than 4.

Alternative

There is a solution that solves the problem in O(n) on average, also with a small constant factor, which is much more efficient for larger k (i.e. when k approaches n/2). The downside is that the (very unlikely) worst-case complexity is O(n²):

Map the n strings to their frequencies, as you have already done. O(n)
Convert the map into an array with string/frequency pairs. O(n)
Apply Quickselect (exactly like Quicksort, but just recurse on one side of the partition) to find the kth largest frequency. All elements to the left are even larger, so the result is the first k elements of the Quickselected array. O(n) average, O(n²) worst-case.

It is possible to implement a variant of Quickselect with guaranteed O(n) complexity using median of medians pivot selection, but this is not that great in practice because the constant factor is quite high. But from an academic standpoint, this would be the asymptotically optimal solution.

(Here is a JavaScript library for Quickselect, though from a quick glance, the implementation doesn't look like it's ideal for this case: a good implementation should do a Dijkstra-style 3-way partitioning.)

Benchmark

A quick and dirty benchmark with n = 10^6, k = 10, measuring only the runtimes after Step 2 (since Steps 1/2 are shared amongst all 5 methods):

Average time for Sort: 7.5 ms
Average time for ChaosMonkey: 10.25 ms
Average time for CountingSort: 5.25 ms
Average time for Mo B. Max-Heap: 4 ms
Average time for Mo B. Alternative (Quickselect): 3.25 ms

https://dotnetfiddle.net/oHRMsp

My conclusion is that for the given parameters there is not much of a difference between the different methods. For simplicity, I would just stick to the sorting method, which also scales well for both n and k.

(Lots of caveats: it's written (sloppily) in C#, not JavaScript; it hasn't been tested for correctness; individual methods are not optimized; runtimes may also depend on distribution of frequencies, the implemented Quickselect is naive in that it's not optimized for the (here) common case where lots of frequencies are equal etc...)

Final note on space

All the benchmarked methods use an additional O(n) space because they first create the frequency map (and the counting sort method uses an additional worst-case O(n) space on top of the frequency map for the count). It is possible to solve the problem with only O(1) additional space, but at the expense of a time complexity of O(kn²).

153

answered Nov 15 '22 04:11

Mo B.

Related questions
                            
                                How to return a value from forEach [duplicate]
                            
                                How to pass html template as props to Vue component
                            
                                Redirect to checkout in ReactJS
                            
                                react is not updating functional component state on input change
                            
                                useState Array of Objects
                            
                                scrollIntoView() not working for horizontal scroll (Selenium)
                            
                                Is it possible to print 16 times in React
                            
                                jest - mockReturnValue : Promise<boolean>
                            
                                The confusion with JIT compilation in V8
                            
                                what is the difference between const [a, b, c] = array; and const {a, b, c} = array;
                            
                                How to use React.forwardRef() in React Native
                            
                                What is Prettier keyboard shortcut command in VS Code to format only a block of React code, not just format on file autosave?
                            
                                How to parse dates in JSON request with NestJs @Body
                            
                                Owl carousel multiple rows
                            
                                Property 'children' is missing in type
                            
                                Material UI + React Form Hook + multiple checkboxes + default selected
                            
                                JavaScript difference between ' ' and (' ')
                            
                                How to destructure this Array
                            
                                cypress wait for redirection after login
                            
                                Typescript - make an optional property required when another property is present

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With