Suppose that you have a large collection of key/value pairs, where the value is some arbitrary real number. You're interested in creating a data structure supporting the following operations:
This data structure could be used, for example, to efficiently determine what percentile a given student is in when receiving a stream of nationwide test scores, or to identify hospitals that have unusually good or bad quality of service.
Is there a way to make these operations run efficiently (say, sublinear time?)
Percentiles can be calculated using the formula n = (P/100) x N, where P = percentile, N = number of values in a data set (sorted from smallest to largest), and n = ordinal rank of a given value. Percentiles are frequently used to understand test scores and biometric measurements.
Explanation: Your percentile is the percent of the data set that is at your score or below. There are 26 students in this particular class. Of those students, 23 students were at a 90 or below.
One possible way to implement this data structure is to use a hybrid of an order statistic tree and a hash table.
An order statistic tree is a type of balanced binary search tree that, in addition to the normal binary search tree operations, supports two more operations:
Order statistic trees can be built by augmenting a normal balanced binary search tree (say, a red/black tree or an AVL tree) with extra information that is preserved during rotations. In this way, all of the normal BST operations on an order statistic tree can be made to run in O(log n) time, with the extra operations also running in O(log n) time.
Now, let's suppose that you were purely storing values scores, rather than key/percentile scores. In this case, it would be very simple to implement the percentile lookups as follows. Store all of the values in the order statistic tree. To determine the percentile score for a given value, use the rank operation on the order statistic tree to look up what index that value appears at. This gives a number, in the range from 0 to n - 1 (where n is the number of elements in the tree), denoting the position of that score in the order statistic tree. You can then multiply that number by 99 / (n - 1), to get a percentile score for the value that runs in the range from 0 to 99, as required.
To determine the lowest value greater than some percentile, you can use the select operation as follows. Given a percentile between 0 and 99, multiple that percentile by 99 / (n - 1) to get a real number between 0 and n - 1, inclusive. Taking the ceiling of that number produces a natural number in the range 0 to n - 1, inclusive. Using the select operation on the order statistic tree then can be used to find the first value in the range that is at or above the given percentile.
However, these operations assume that we have purely values in the data structure, not key/value pairs. To make this operation work for key/value pairs, we will augment our data structure as follows:
These two changes make it possible to implement the needed functionality for our data structure. To get the data structure to do percentile lookups by key, we first query the hash table with the given key to look up its associated value. We then do a percentile lookup on the value as done before. To get the data structure to tell us a key whose value is the first at or above a given percentile, we do a normal find-percentile operation on the order statistic tree as described above, then look up the key associated with the given value.
If we assume that the hash table uses chained hashing, then the time required for each operation is given below:
Hope this helps!
There is a simple and highly efficient possibility:
If you can live with searching the percentile only in a finally filled up students structure then:
Use an ArrayList to dynamically build up when you don't know the number of elements.
If you know them then start with the array directly, otherwise create the array from the dynamic array. (e.g ArrayList in java).
insert: not neccesary, replaced by adding at end, and sorting once.
delete: not neccessary, if you can live with that.
tell-percentile: even simpler: something very near to: element[length * percentile]: O(1)
In practise the array approach will be much faster than the balanced tree approach, at least in java, when your application can build up the array once (e.g daily student evaluation, build it up daily)
I have implemented the (my) above algorithm using an self written ArrayListInt, which does the same as ArrayList but uses primitive types (double, int), instead of objects types. I sorted it once, when all data have been read.
Further you wanted key value:
I would just add an Tree Map (balanced tree). Now it is a bit doubtfull if the TreeMap and the additonal percentile array make sense: That depends then on how often you have to search, and memory usage versus search time.
Update:
Results: treeset vs sorted array (dynamic build up the array, then finally sort once:
num elements: 1000 treeSet: 4.55989 array=0.564159
num elements: 10000 treeSet: 2.662496 array=1.157591
num elements: 100000 treeSet: 31.642027 array=12.224639
num elements: 1000000 treeSet: 1319.283703 array=140.293312
num elements: 10000000 treeSet: 21212.307545 array=3222.844045
This ammount of elements (1e7) now is near the limit (1GB heap space) , in the next step memory would run out (already happedned at 1e7, but with cleaning up memory after treeset, the run for measure 1e7 worked, too.
What is missing are the search times, but an sorted array with binsearch is only beatable by an hashtable
Finally: If you can build up the student set once, e.g daily, then using the array approach gives a much simpler percentile search.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With