Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

JS: writing a function that iterates through a list of strings and returns the top 10 most frequent strings in the list

I am trying to write a function that iterates through a list of strings and returns the top 10 most frequent strings in the list. I am trying to come up with multiple solutions to this question

Here is my first solution

const list = [
    "this",
    "is",
    "a",
    "test",
    "which",
    "word",
    "wins",
    "top",
    "i",
    "don't",
    "know",
    "off",
    "hand",
    "do",
    "you",
    "this",
    "a",
    "a",
    "this",
    "test",
    "a",
    "a",
    "do",
    "hand",
    "hand",
    "a",
    "whatever",
    "what",
    "do",
    "do"
  ];

function fn1(strArr) {
    const map = new Map()
    for(const str of strArr) {
        if(map.has(str)) {
            map.set(str, map.get(str) + 1)
        } else {
            map.set(str, 1)
        }
    }
    const sortedMap =[...map.entries()].sort(([_,a], [__,b]) => a < b ? 1 : -1)
    return sortedMap.slice(0 , 10).map(([str]) => str)
}

But I cannot seem to find any other solutions to this question. Can anyone suggest an alternative suggestion?

Also, one thing to note that is the list can be really large, maybe contain 1 million strings. So we need to try to minimize the runtime complexity

like image 483
Joji Avatar asked Dec 24 '20 05:12

Joji


People also ask

How do I iterate through a string in JavaScript?

Javascript String @@iterator Method. String [@@iterator]( ) Method is used to make String iterable. [@@iterator]() returns iterator object which iterate over all code point of String. String[@@iterator] is Built – in Property of String.

Can you use forEach on a string JavaScript?

Nope, it can't. forEach() passes the index and the array as second and third argument.

Can you index strings in JavaScript?

Introduction. A string is a sequence of one or more characters that may consist of letters, numbers, or symbols. Each character in a JavaScript string can be accessed by an index number, and all strings have methods and properties available to them.


1 Answers

I believe the following solution is the fastest in practice:

  1. Map the n strings to their frequencies, as you have already done. O(n)
  2. Convert the map into an array with string/frequency pairs. O(n)
  3. Convert the array into a Max-heap based on the frequency using Floyd's method (i.e., by calling max-heapify for all indices from ⌊n/2⌋ - 1 down to 0). O(n)
  4. Extract the top element k times. (In your case, k=10.) O(k log n)

(I don't provide any code here because it basically consists of calling a binary heap library (see here for a highly optimized implementation).)

Analysis

The asymptotic complexity is O(n + k log n) which is less than O(n log k) from the solution provided by Chaos Monkey. For small k (e.g. 10), there probably won't be any significant difference. The difference becomes more apparent for larger k (as long as k ≪ n; for much larger k, see "Alternative" below). Also note that the constant factor for Steps 1 and 2 is 1, and the constant factor in Step 3 is also small on average (1.8814), so the overall constant factor is less than 4.

Alternative

There is a solution that solves the problem in O(n) on average, also with a small constant factor, which is much more efficient for larger k (i.e. when k approaches n/2). The downside is that the (very unlikely) worst-case complexity is O(n²):

  1. Map the n strings to their frequencies, as you have already done. O(n)
  2. Convert the map into an array with string/frequency pairs. O(n)
  3. Apply Quickselect (exactly like Quicksort, but just recurse on one side of the partition) to find the kth largest frequency. All elements to the left are even larger, so the result is the first k elements of the Quickselected array. O(n) average, O(n²) worst-case.

It is possible to implement a variant of Quickselect with guaranteed O(n) complexity using median of medians pivot selection, but this is not that great in practice because the constant factor is quite high. But from an academic standpoint, this would be the asymptotically optimal solution.

(Here is a JavaScript library for Quickselect, though from a quick glance, the implementation doesn't look like it's ideal for this case: a good implementation should do a Dijkstra-style 3-way partitioning.)

Benchmark

A quick and dirty benchmark with n = 10^6, k = 10, measuring only the runtimes after Step 2 (since Steps 1/2 are shared amongst all 5 methods):

Average time for Sort: 7.5 ms
Average time for ChaosMonkey: 10.25 ms
Average time for CountingSort: 5.25 ms
Average time for Mo B. Max-Heap: 4 ms
Average time for Mo B. Alternative (Quickselect): 3.25 ms

https://dotnetfiddle.net/oHRMsp

My conclusion is that for the given parameters there is not much of a difference between the different methods. For simplicity, I would just stick to the sorting method, which also scales well for both n and k.

(Lots of caveats: it's written (sloppily) in C#, not JavaScript; it hasn't been tested for correctness; individual methods are not optimized; runtimes may also depend on distribution of frequencies, the implemented Quickselect is naive in that it's not optimized for the (here) common case where lots of frequencies are equal etc...)

Final note on space

All the benchmarked methods use an additional O(n) space because they first create the frequency map (and the counting sort method uses an additional worst-case O(n) space on top of the frequency map for the count). It is possible to solve the problem with only O(1) additional space, but at the expense of a time complexity of O(kn²).

like image 153
Mo B. Avatar answered Nov 15 '22 04:11

Mo B.