Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find the most frequent number in an array, with limited memory

How to find the most frequent number in an array? The array can be extremely large, for example 2GB and we only have limited memory, say 100MB.

I'm thinking about external sort, which is sorting and than duplicating numbers that are next to each other. Or hashma. But don't know what to do with the limited memory. And I'm even not sure if external sort is a good idea for this.

like image 299
JudyJiang Avatar asked Jan 17 '14 17:01

JudyJiang


People also ask

How do you find the most frequent value in an array?

Steps to find the most frequency value in a NumPy array: Create a NumPy array. Apply bincount() method of NumPy to get the count of occurrences of each element in the array. The n, apply argmax() method to get the value having a maximum number of occurrences(frequency).

How do you count elements in an array?

You can simply use the PHP count() or sizeof() function to get the number of elements or values in an array. The count() and sizeof() function returns 0 for a variable that has been initialized with an empty array, but it may also return 0 for a variable that isn't set.


1 Answers

In the worst case, all your numbers are distinct except for one number which appears twice, and there's no way to detect this in main memory unless you have the two duplicate numbers loaded into main memory at the same time, which is unlikely without sorting if your total data size is much larger than main memory size. In that case, aysmptotically the best thing to do is sort numbers in batches and save to disk in files, and then do a merge sort merge step reading in all the sorted files into memory a few lines at a time, and outputting the merged sorted list to a new file. Then you go through the aggregate sorted file in order and count how many times you see each number, keeping track of which number has occurred the most times.

If you assume that the most frequent number is 50% frequency or higher, then you can do much better. You can solve the problem with constant extra memory just going through the list of numbers once. Basically you start by initializing the most common value (MCV) to the first number and initialize a counter N to 1. Then you go through the list. If the next number in the list is the MCV, you increase N by one. Otherwise you decrease N by 1. If N is 0 and the next number is different than MCV, then you set MCV to the new number and set N to 1. It is easy to prove this will terminate with the most common value stored in MCV.

like image 199
user2566092 Avatar answered Sep 23 '22 01:09

user2566092