Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate median of a Map<Int,Int>?

For a map where the key represents a number of a sequence and the value the count how often this number appeared in the squence, how would an implementation of an algorithm in java look like to calculate the median?

For example:

1,1,2,2,2,2,3,3,3,4,5,6,6,6,7,7

in a map:

Map<Int,Int> map = ...
map.put(1,2)
map.put(2,4)
map.put(3,3)
map.put(4,1)
map.put(5,1)
map.put(6,3)
map.put(7,2)

double median = calculateMedian(map);
print(median);

would result in:

> print(median);
3
>

So what i am looking for is a java implementation of calculateMedian.

like image 670
Chris Avatar asked Jun 16 '10 11:06

Chris


People also ask

How do you calculate median?

Median: The middle number; found by ordering all data points and picking out the one in the middle (or if there are two middle numbers, taking the mean of those two numbers). Example: The median of 4, 1, and 7 is 4 because when the numbers are put in order (1 , 4, 7) , the number 4 is in the middle.


2 Answers

Linear time

If you know the total of the numbers (in your case it is 16) you can go from the beginning or the end of the map and sum up the counts until you get to round(n/2)th element, or in case the sum is even to average of floor(n/2)th and ceil(n/2)th elements = median.

If you don't know the total count you will have to go through all of them at least once.

Sublinear time

If you can decide on the data structure and can do pre-processing see wikipedia on selection algorithm and you might get even sublinear algorithm. You can also get sublinear time if you know something about the distribution of the data.

EDIT: So under assumption that we have a sequence with counts what we can do is

  • while inserting the key -> count pairs maintain another map - key -> running_total
  • this way you will have a structure in which you will be able to get total_count by looking at the last key's running_total
  • and you will be able to do a binary search to locate the element where running total is close to total_count/2

This will double the memory usage, but will give O(log n) performance for median and O(1) for total_count.

like image 188
Unreason Avatar answered Oct 04 '22 04:10

Unreason


Using Guava:

Multiset<Integer> values = TreeMultiset.create();
Collections.addAll(values, 1,1,2,2,2,2,3,3,3,4,5,6,6,6,7,7);

Now the answer to your question is:

return Iterables.get(values, (values.size() - 1) / 2);

Really. That's it. (Or check if size is even and average the two central values, to be precise about it.)

If the counts are particularly large, it would be faster to use the multiset's entrySet and keep a running sum, but the simplest way is usually fine.

like image 37
Kevin Bourrillion Avatar answered Oct 04 '22 03:10

Kevin Bourrillion