Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Turning a bunch of numeric attributes into a single score

This comes up a lot and it's surprising there doesn't seem to be a standard solution. Say I have a bunch of numeric attributes -- you can imagine using this for ranking colleges or cities based on a bunch of component scores like student/teacher ratio or pollution or whatnot -- and want to turn them into a single score.

I'd like to take a bunch of examples and interpolate to get a consistent scoring function.

Maybe there are standard multidimensional curve-fitting or data-smoothing libraries or something that makes this straightforward?

More examples:

  • Turning the two blood pressure numbers into a single score for how close to optimal your blood pressure is
  • Turning body measurements into a single measure of how far you are from your ideal physique
  • Turning a set of times (100-meter dash, etc) into a fitness score for a certain sport
like image 651
dreeves Avatar asked Dec 30 '14 20:12

dreeves


1 Answers

tl;dr: Check out HiScore. It will allow you to quickly write and maintain scoring functions that behave in sensible ways.

To instantiate your simple example, let's say you have an app that receives as input a set of distances and times, and you want to map them to a 1-100 score. For instance, you get (1.2 miles, 8:37) and you'd like to return, say, 64.

The typical approach is to pick several basis functions and then futz around with the coefficients of those basis functions to get scores that "look right". For instance, you may have a linear basis function on minutes-per-mile, with additional basis functions for distance (maybe both linear in distance and linear in the square root of distance). You could even use e.g., radial basis functions for more complex expressiveness across your range of inputs. (This is very similar to what other answers have suggested in terms of ML algorithms like SVMs and the like.)

This approach is typically pretty fast, but there are many downsides. First, you have to get the basis functions right, which can be hard for more abstract and expressive functions. Second, you'll find that your score will ossify quickly: if you find an input that you feel is mis-scored, figuring out how to change it while making sure the rest of the scoring function "looks right" will be a challenge. Third, adding another attribute to the score (e.g., if the runner is male or female) can be difficult, as you may find that you'll need to add many more terms to your basis. Finally, there's no explicit guarantee in this approach that your score will behave intelligently---depending on the basis functions and coefficients you select, someone running a mile in 7:03 could get a higher score than someone running 1.1 miles in 7:01.

A different approach exists in the form of HiScore, a python library I wrote when faced with a similar problem. With HiScore, you label a reference set of items with scores and then it generates a scoring function that intelligently interpolates through those scores. For instance, you could take the last 100 inputs to your app, combine them with a handful of your most extreme inputs (perhaps take the convex hull of your submitted inputs in (distance, time) space), label them, and use HiScore to produce a reasonable scoring function. And if it ever comes up with a score that you disagree with, just add it to the reference set with the correct label and re-create the scoring function, because HiScore guarantees interpolation through the reference set.

One property of HiScore is that your attributes need to be monotone, or always increasing or decreasing. This is not a problem for the "running times" setting, because the score should go up as distance increases (for a fixed time) and down as time increases (for a fixed distance). HiScore's monotonicity gives you confidence your score will behave as expected; it guarantees someone running a mile in 7:03 will score no higher than someone running 1.1 miles in 7:01.

The blood pressure setting you bring up is interesting because it's not monotone. Low blood pressure is bad, but high blood pressure is bad too. You can still use HiScore here though: just split each measurement into a "high blood pressure" and "low blood pressure" component, where at least one of these is zero. For instance a systolic reading of 160 would be mapped into a systolic+ attribute of 60 and a systolic- attribute of 0. The score should be decreasing in both of these new attributes, and so this approach turns a non-monotone two-dimensional problem (with attributes systolic and diastolic) into a monotone four-dimensional one (with attributes systolic+, systolic-, diastolic+, diastolic-). (This trick is similar to one that helps get Linear Programs into canonical form.)

like image 193
aothman Avatar answered Nov 08 '22 22:11

aothman