I am trying to write a function that iterates through a list of strings and returns the top 10 most frequent strings in the list. I am trying to come up with multiple solutions to this question
Here is my first solution
const list = [
"this",
"is",
"a",
"test",
"which",
"word",
"wins",
"top",
"i",
"don't",
"know",
"off",
"hand",
"do",
"you",
"this",
"a",
"a",
"this",
"test",
"a",
"a",
"do",
"hand",
"hand",
"a",
"whatever",
"what",
"do",
"do"
];
function fn1(strArr) {
const map = new Map()
for(const str of strArr) {
if(map.has(str)) {
map.set(str, map.get(str) + 1)
} else {
map.set(str, 1)
}
}
const sortedMap =[...map.entries()].sort(([_,a], [__,b]) => a < b ? 1 : -1)
return sortedMap.slice(0 , 10).map(([str]) => str)
}
But I cannot seem to find any other solutions to this question. Can anyone suggest an alternative suggestion?
Also, one thing to note that is the list can be really large, maybe contain 1 million strings. So we need to try to minimize the runtime complexity
Javascript String @@iterator Method. String [@@iterator]( ) Method is used to make String iterable. [@@iterator]() returns iterator object which iterate over all code point of String. String[@@iterator] is Built – in Property of String.
Nope, it can't. forEach() passes the index and the array as second and third argument.
Introduction. A string is a sequence of one or more characters that may consist of letters, numbers, or symbols. Each character in a JavaScript string can be accessed by an index number, and all strings have methods and properties available to them.
I believe the following solution is the fastest in practice:
n
strings to their frequencies, as you have already done. O(n)
O(n)
⌊n/2⌋ - 1
down to 0). O(n)
k
times. (In your case, k=10
.) O(k log n)
(I don't provide any code here because it basically consists of calling a binary heap library (see here for a highly optimized implementation).)
The asymptotic complexity is O(n + k log n)
which is less than O(n log k)
from the solution provided by Chaos Monkey. For small k
(e.g. 10), there probably won't be any significant difference. The difference becomes more apparent for larger k
(as long as k ≪ n
; for much larger k
, see "Alternative" below). Also note that the constant factor for Steps 1 and 2 is 1, and the constant factor in Step 3 is also small on average (1.8814), so the overall constant factor is less than 4.
There is a solution that solves the problem in O(n)
on average, also with a small constant factor, which is much more efficient for larger k
(i.e. when k
approaches n/2
). The downside is that the (very unlikely) worst-case complexity is O(n²)
:
n
strings to their frequencies, as you have already done. O(n)
O(n)
k
th largest frequency. All elements to the left are even larger, so the result is the first k
elements of the Quickselected array. O(n)
average, O(n²)
worst-case.It is possible to implement a variant of Quickselect with guaranteed O(n)
complexity using median of medians pivot selection, but this is not that great in practice because the constant factor is quite high. But from an academic standpoint, this would be the asymptotically optimal solution.
(Here is a JavaScript library for Quickselect, though from a quick glance, the implementation doesn't look like it's ideal for this case: a good implementation should do a Dijkstra-style 3-way partitioning.)
A quick and dirty benchmark with n = 10^6
, k = 10
, measuring only the runtimes after Step 2 (since Steps 1/2 are shared amongst all 5 methods):
Average time for Sort: 7.5 ms
Average time for ChaosMonkey: 10.25 ms
Average time for CountingSort: 5.25 ms
Average time for Mo B. Max-Heap: 4 ms
Average time for Mo B. Alternative (Quickselect): 3.25 ms
https://dotnetfiddle.net/oHRMsp
My conclusion is that for the given parameters there is not much of a difference between the different methods. For simplicity, I would just stick to the sorting method, which also scales well for both n
and k
.
(Lots of caveats: it's written (sloppily) in C#, not JavaScript; it hasn't been tested for correctness; individual methods are not optimized; runtimes may also depend on distribution of frequencies, the implemented Quickselect is naive in that it's not optimized for the (here) common case where lots of frequencies are equal etc...)
All the benchmarked methods use an additional O(n)
space because they first create the frequency map (and the counting sort method uses an additional worst-case O(n)
space on top of the frequency map for the count). It is possible to solve the problem with only O(1)
additional space, but at the expense of a time complexity of O(kn²)
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With