Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which keywords most distinguish two groups of people?

I have a database of keywords used in searches by people of different groups. Something like:

group1person1: x, y, z
group1person2: x, z, d
...
group2person1: z, d, l
...

and so on

I want to see which keywords are most characteristic of a given group. I'm trying to do what OkCupid did in their blog: http://blog.okcupid.com/index.php/the-real-stuff-white-people-like/

Can anyone recommend suitable algorithms / terminology / advice regarding this task?

(I'll be doing this in Python)

Thanks in advance!

like image 429
DrMisha Avatar asked Aug 27 '11 00:08

DrMisha


1 Answers

Your question more or less recites the core use case for the ID3 algorithm.

The output of ID3 is a classifier, which has a binary tree structure (ID3, C4.5, et al. are often referred to as decision trees). The Wikipedia entry for Decision Tree Learning actually has a decent summary (at the algorithm level) of ID3.

Two usual metric within ID3 that determines how that portion of the data at a given node ought to be split is called Information Entropy. (A lesser used metric is Gini Impurity.) The ID3 algoritm is just a recursive descent parser that tests all combinations of variable/value and splits the node on the combination that give the lowest weighted average Entropy.

Intuitively, Information Entropy is trying to identify the variable (column) and value within that variable that splits the data "best". A "best split" is consistent with our intuition. This is much easier to show than it is to describe by prose. Consider this data set:

Height      Weight      Age     90 min aerobics/wk?     completed 5 mile run?
 155         45          31           Yes                      True
 160         51          33           No                       False
 168         52          28           No                       False
 155         61          25           Yes                      True
 169         57          52           Yes                      True
 172         81          35           No                       False
 164         70          23           Yes                      False

If the data is split on column 4 (does the person engage in aerobic exercise for at least 90 minutes each week?) then the resulting two groups of class labels look like this:

Yes Group: [True, True, True, False]

No Group: [False, False, False]

Almost, but not quite, perfect heterogeneity among the two groups. So obviously column 4 is the 'best' variable on which to split this data.

The metric used in the ID3 algorithm to determine the best split is just a mathematical formalism of this intuition.

This is not a perfect (mathematically precise) analogy, but roughly, you can think of Information Entropy being related to categorical variables (discrete values) as Variance is related to continuous variables (floats). In other words--Information Entropy (roughly) expresses the variance (or standard deviation) of discrete data.

Here's a python function to calculate entropy (using NumPy):

def entropy(arr1) :
    import numpy as NP
    ue = NP.unique(x)
    p, entropy = 0., 0.
    for itm in ue :
        ndx = arr1 == itm
        p += NP.size(x[ndx]) / float(x.size)
        entropy -= p * NP.log2(p)
    return entropy

The entropy function above is just these two expressions combined and reduced to code:

p(i) = frequency(outcome) = count(outcome) / count(total_rows)

entropy = sum of p(i) x log2(p(i))

Perfect hetergeneity has entropy = 0, so the most "distinguishing" variable/value is the one such that when you split the data on that variable and value, the weighted average entropy is lowest. Entropy values close to 1 are almost completely 'mixed' or near random.

# simulate a data set with three class labels (0 1, 2)
# for your problem, the class labels are the keywords, 
# so just map each unique keyword to an integer value (e.g., { 'keyword1' : 0, 'keyword2' : 1}
>>> x = NP.random.randint(0, 3, 20)
>>> x
   array([1, 0, 0, 0, 1, 1, 2, 1, 1, 1, 2, 2, 0, 2, 0, 1, 1, 1, 1, 1])

>>> print("{0:.3f}".format(entropy(x)))
   0.758

In sum, for your particular problem, to determine the most "distinguishing" keyword, calculate the Entropy for each of the two class label lists, then calculate their weighted average (weighted by the number of items in each list). The keyword that results in the split with the lowest weighted average Entropy is what you are after.

like image 118
doug Avatar answered Sep 18 '22 15:09

doug