Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding Median of more than 20 Million of 3 to 4 different integers in 1.5 seconds

I am trying to sort and find the median of a string of integers that only contains 3 to 4 different integers.

The amount of numbers I am dealing with is of magnitudes of about 20 to 25 million and I am supposed to sort the vector and find the median each time a new integer is added into the vector and add the median into a separate "Total" variable which sums up all the medians each time a median is generated.

1                   Median: 1              Total: 1
1 , 2               Median: (1+2)/2 = 1    Total: 1 + 1 = 2
1 , 2 , 3           Median: 2              Total: 2 + 2 = 4
1 , 1 , 2 , 3       Median: (1+2)/2 = 1    Total: 4 + 1 = 5
1 , 1 , 1 , 2 , 3   Median: 1              Total: 5 + 1 = 6

I am trying to find a way to optimize my code further because it is just not efficient enough. (Got to process under 2s or so) Does anyone have any idea how to further speed up my code logic?

I am currently using 2 heaps, or priority queues in C++. One functioning as a max-heap and the other functioning as a min-heap.

Gotten the idea from Data structure to find median

You can use 2 heaps, that we will call Left and Right.
Left is a Max-Heap.
Right is a Min-Heap.
Insertion is done like this:

If the new element x is smaller than the root of Left then we insert x to 
Left.
Else we insert x to Right.
If after insertion Left has count of elements that is greater than 1 from 
the count of elements of Right, then we call Extract-Max on Left and insert 
it to Right.
Else if after insertion Right has count of elements that is greater than the 
count of elements of Left, then we call Extract-Min on Right and insert it 
to Left.
The median is always the root of Left.

So insertion is done in O(lg n) time and getting the median is done in O(1) 
time.

but it is just not fast enough...

like image 888
deviljones Avatar asked Aug 28 '18 16:08

deviljones


2 Answers

If you only ever have three to four distinct integers in the string, you can just keep track of how many times each one appears by traversing the string once. Adding (and removing elements) from this representation is also doable in constant time.

class MedianFinder
{
public:
  MedianFinder(const std::vector<int>& inputString)
  {
    for (int element : inputString)
      _counts[element]++; // Inserts 0 into map if element is not in there.
  }

  void addStringEntry(int entry)
  {
    _counts[entry]++;
  }

  int getMedian() const
  {
    size_t numberOfElements = 0;
    for (auto kvp : _counts)
      numberOfElements += kvp.second;

    size_t cumulativeCount = 0;
    int lastValueBeforeMedian;
    for (auto kvp : _counts)
    {
      cumulativeCount += kvp.second;
      if (cumulativeCount >= numberOfElements/2)
        lastValueBeforeMedian = kvp.first;
    }

    // TODO! Handle the case of the median being in between two buckets.
    //return ...
  }

private:
  std::map<int, size_t> _counts;
};

The trivial task of summing the medians is not shown here.

like image 190
Max Langhof Avatar answered Nov 14 '22 22:11

Max Langhof


I would not focus on optimizing as much as decreasing the complexity from O(n * log n) to O(n).

Your algorithm is O(n * log n) because you do n insertions each costing amortized O(log n) time.

There is a well known O(n) algorithm for median finding. I suggest using this.

Usually log n is not a big deal, but for 20 Million elements it can make your algorithm ~25 times faster.

Oh, my bad. I didn't realize there are only 3-4 different integers...

like image 43
Kostas Avatar answered Nov 14 '22 22:11

Kostas