For a map where the key represents a number of a sequence and the value the count how often this number appeared in the squence, how would an implementation of an algorithm in java look like to calculate the median?
For example:
1,1,2,2,2,2,3,3,3,4,5,6,6,6,7,7
in a map:
Map<Int,Int> map = ...
map.put(1,2)
map.put(2,4)
map.put(3,3)
map.put(4,1)
map.put(5,1)
map.put(6,3)
map.put(7,2)
double median = calculateMedian(map);
print(median);
would result in:
> print(median);
3
>
So what i am looking for is a java implementation of calculateMedian
.
Median: The middle number; found by ordering all data points and picking out the one in the middle (or if there are two middle numbers, taking the mean of those two numbers). Example: The median of 4, 1, and 7 is 4 because when the numbers are put in order (1 , 4, 7) , the number 4 is in the middle.
Linear time
If you know the total of the numbers (in your case it is 16) you can go from the beginning or the end of the map and sum up the counts until you get to round(n/2)th element, or in case the sum is even to average of floor(n/2)th and ceil(n/2)th elements = median.
If you don't know the total count you will have to go through all of them at least once.
Sublinear time
If you can decide on the data structure and can do pre-processing see wikipedia on selection algorithm and you might get even sublinear algorithm. You can also get sublinear time if you know something about the distribution of the data.
EDIT: So under assumption that we have a sequence with counts what we can do is
key -> count
pairs maintain another map - key -> running_total
This will double the memory usage, but will give O(log n) performance for median and O(1) for total_count.
Using Guava:
Multiset<Integer> values = TreeMultiset.create();
Collections.addAll(values, 1,1,2,2,2,2,3,3,3,4,5,6,6,6,7,7);
Now the answer to your question is:
return Iterables.get(values, (values.size() - 1) / 2);
Really. That's it. (Or check if size is even and average the two central values, to be precise about it.)
If the counts are particularly large, it would be faster to use the multiset's entrySet
and keep a running sum, but the simplest way is usually fine.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With