Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Affinity Propagation Clustering for Addresses

I have a list of addresses for many people (1-8 addresses each) and I'm trying to identify the number of unique addresses each person has.

here is a sample address dataset for one person

#df[df['ID'] =='12345'][['address','zip]].values
addresses = [['PULMONARY MED ASSOC MED GROUP INC 1485 RIVER PARK DR STE 200',
        '95815'],
       ['1485 RIVER PARK DRIVE SUITE 200', '95815'],
       ['1485 RIVER PARK DR SUITE 200', '95815'],
       ['3637 MISSION AVE SUITE 7', '95608']]

I've got an address parser that separates the different portions of the address, the "attn", house number, street name, PO Box, etc so that I can compare them individually (code found here)

As you can see from the data above, addresses 1-3 are probably the same, and address 4 is different.

I wrote the following similarity calculation method - there's no magic to the weights, just what my intuition said should be most important

def calcDistance(a1, a2,z1,z2, parser):

    z1 = str(z1)
    z2 = str(z2)
    add1 = parser.parse(a1)
    add2 = parser.parse(a2)

    zip_dist = 0 if z1 == z2 else distance.levenshtein(z1,z2)
    zip_weight = .4

    attn_dist = distance.levenshtein(add1['attn'],add2['attn']) if add1['attn'] and add2['attn'] else 0
    attn_weight = .1 if add1['attn'] and add2['attn'] else 0

    suite_dist = distance.levenshtein(add1['suite_num'],add2['suite_num']) if add1['suite_num'] and add2['suite_num'] else 0
    suite_weight = .1 if add1['suite_num'] and add2['suite_num'] else 0

    street_dist = distance.levenshtein(add1['street_name'],add2['street_name']) if add1['street_name'] and add2['street_name'] else 0
    street_weight = .3 if add1['street_name'] and add2['street_name'] else 0

    house_dist = distance.levenshtein(add1['house'],add2['house']) if add1['house'] and add2['house'] else 0
    house_weight = .1 if add1['house'] and add2['house'] else 0

    weight = (zip_dist * zip_weight + attn_dist * attn_weight + suite_dist * suite_weight + street_dist * street_weight
            + house_dist * house_weight ) / (zip_weight +attn_weight + suite_weight + street_weight + house_weight )

    return weight

Applying this function to each of my addresses, you can see that addresses 1-3 correctly are completely similar, and address 4 is a bit different.

similarity = -1*np.array([[calcDistance(a1[0],a2[0],a1[1],a2[1],addr_parser) for a1 in addresses] for a2 in addresses])

print similarity 

array([[-0.        , -0.        , -0.        , -5.11111111],
       [-0.        , -0.        , -0.        , -5.11111111],
       [-0.        , -0.        , -0.        , -5.11111111],
       [-5.11111111, -5.11111111, -5.11111111, -0.        ]])

To then cluster these, I thought affinity clustering might be the best way to go - the cluster count is variable, it works with distances, and can identify a prototypical example, which I could then use the "best" address to represent the cluster. However, I'm getting some strange results - the affinityprop clusterer is producing 3 clusters for this data, instead of 2.

affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping=.5)
affprop.fit(similarity)

print affprop.labels_
array([0, 0, 1, 2], dtype=int64)

Conversely, DBSCAN correctly clusters into two

dbscan = sklearn.cluster.DBSCAN(min_samples=1)
dbscan.fit(similarity)

print dbscan.labels_
array([0, 0, 0, 1], dtype=int64)

Looking at this question, it seems that the issues is the clusterer is adding small random starting points, and counting the perfectly similar records as degeneracies.

Is there a way around this or should I just give up with the affinity clustering and stick to DBSCAN?

like image 634
flyingmeatball Avatar asked Feb 02 '17 16:02

flyingmeatball


People also ask

How do you decide different clusters in Affinity Propagation?

High values of the preferences will cause affinity propagation to find many exemplars (clusters), while low values will lead to a small number of exemplars (clusters). A good initial choice for the preference is the minimum similarity or the median similarity.

What type of clustering is Affinity Propagation?

In statistics and data mining, affinity propagation (AP) is a clustering algorithm based on the concept of "message passing" between data points.

Which of the following parameters are used to control Affinity Propagation clustering?

Unlike clustering algorithms such as k-means or k-medoids, affinity propagation does not require the number of clusters to be determined or estimated before running the algorithm, for this purpose the two important parameters are the preference, which controls how many exemplars (or prototypes) are used, and the ...

What is preference in Affinity Propagation?

The Affinity Propagation algorithm takes as input a real number s(k,k) for each data point k — referred to as a “preference”. Data points with large values for s(k,k) are more likely to be exemplars. The number of clusters is influenced by the preference values and the message-passing procedure.


1 Answers

While I would suspect that this issue would go away with larger samples of different groups (see example below), in your case It looks like you'll want to increase the damping factor to get the desired result. Starting at .95 you get the correct grouping:

>>> affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping=.95)
>>> affprop.fit(similarity)
AffinityPropagation(affinity='precomputed', convergence_iter=15, copy=True,
          damping=0.95, max_iter=200, preference=None, verbose=False)
>>> print affprop.labels_
[0 0 0 1]

As I mentioned initially, this problem seems to go away as you add more different data to your set. Looking at the example in the question you referenced, we see they initially have the same issue:

>>> c = [[0], [0], [0], [0], [0], [0], [0], [0]]
>>> af = sklearn.cluster.AffinityPropagation (affinity = 'euclidean').fit (c)
>>> print (af.labels_)
[0 1 0 1 2 1 1 0]

This goes away with higher damping:

>>> c = [[0], [0], [0], [0], [0], [0], [0], [0]]
>>> af = sklearn.cluster.AffinityPropagation (affinity = 'euclidean', damping=.99).fit (c)
>>> print (af.labels_)
[0 0 0 0 0 0 0 0]

Or as we introduce more groups:

>>> c = [[0], [0], [0], [1], [2], [1], [2], [1]]
>>> af = sklearn.cluster.AffinityPropagation (affinity = 'euclidean', damping=.5).fit (c)
>>> print (af.labels_)
[0 0 0 2 1 2 1 2]
like image 94
Tchotchke Avatar answered Oct 19 '22 23:10

Tchotchke