I have a list of addresses for many people (1-8 addresses each) and I'm trying to identify the number of unique addresses each person has.
here is a sample address dataset for one person
#df[df['ID'] =='12345'][['address','zip]].values
addresses = [['PULMONARY MED ASSOC MED GROUP INC 1485 RIVER PARK DR STE 200',
'95815'],
['1485 RIVER PARK DRIVE SUITE 200', '95815'],
['1485 RIVER PARK DR SUITE 200', '95815'],
['3637 MISSION AVE SUITE 7', '95608']]
I've got an address parser that separates the different portions of the address, the "attn", house number, street name, PO Box, etc so that I can compare them individually (code found here)
As you can see from the data above, addresses 1-3 are probably the same, and address 4 is different.
I wrote the following similarity calculation method - there's no magic to the weights, just what my intuition said should be most important
def calcDistance(a1, a2,z1,z2, parser):
z1 = str(z1)
z2 = str(z2)
add1 = parser.parse(a1)
add2 = parser.parse(a2)
zip_dist = 0 if z1 == z2 else distance.levenshtein(z1,z2)
zip_weight = .4
attn_dist = distance.levenshtein(add1['attn'],add2['attn']) if add1['attn'] and add2['attn'] else 0
attn_weight = .1 if add1['attn'] and add2['attn'] else 0
suite_dist = distance.levenshtein(add1['suite_num'],add2['suite_num']) if add1['suite_num'] and add2['suite_num'] else 0
suite_weight = .1 if add1['suite_num'] and add2['suite_num'] else 0
street_dist = distance.levenshtein(add1['street_name'],add2['street_name']) if add1['street_name'] and add2['street_name'] else 0
street_weight = .3 if add1['street_name'] and add2['street_name'] else 0
house_dist = distance.levenshtein(add1['house'],add2['house']) if add1['house'] and add2['house'] else 0
house_weight = .1 if add1['house'] and add2['house'] else 0
weight = (zip_dist * zip_weight + attn_dist * attn_weight + suite_dist * suite_weight + street_dist * street_weight
+ house_dist * house_weight ) / (zip_weight +attn_weight + suite_weight + street_weight + house_weight )
return weight
Applying this function to each of my addresses, you can see that addresses 1-3 correctly are completely similar, and address 4 is a bit different.
similarity = -1*np.array([[calcDistance(a1[0],a2[0],a1[1],a2[1],addr_parser) for a1 in addresses] for a2 in addresses])
print similarity
array([[-0. , -0. , -0. , -5.11111111],
[-0. , -0. , -0. , -5.11111111],
[-0. , -0. , -0. , -5.11111111],
[-5.11111111, -5.11111111, -5.11111111, -0. ]])
To then cluster these, I thought affinity clustering might be the best way to go - the cluster count is variable, it works with distances, and can identify a prototypical example, which I could then use the "best" address to represent the cluster. However, I'm getting some strange results - the affinityprop clusterer is producing 3 clusters for this data, instead of 2.
affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping=.5)
affprop.fit(similarity)
print affprop.labels_
array([0, 0, 1, 2], dtype=int64)
Conversely, DBSCAN correctly clusters into two
dbscan = sklearn.cluster.DBSCAN(min_samples=1)
dbscan.fit(similarity)
print dbscan.labels_
array([0, 0, 0, 1], dtype=int64)
Looking at this question, it seems that the issues is the clusterer is adding small random starting points, and counting the perfectly similar records as degeneracies.
Is there a way around this or should I just give up with the affinity clustering and stick to DBSCAN?
High values of the preferences will cause affinity propagation to find many exemplars (clusters), while low values will lead to a small number of exemplars (clusters). A good initial choice for the preference is the minimum similarity or the median similarity.
In statistics and data mining, affinity propagation (AP) is a clustering algorithm based on the concept of "message passing" between data points.
Unlike clustering algorithms such as k-means or k-medoids, affinity propagation does not require the number of clusters to be determined or estimated before running the algorithm, for this purpose the two important parameters are the preference, which controls how many exemplars (or prototypes) are used, and the ...
The Affinity Propagation algorithm takes as input a real number s(k,k) for each data point k — referred to as a “preference”. Data points with large values for s(k,k) are more likely to be exemplars. The number of clusters is influenced by the preference values and the message-passing procedure.
While I would suspect that this issue would go away with larger samples of different groups (see example below), in your case It looks like you'll want to increase the damping
factor to get the desired result. Starting at .95 you get the correct grouping:
>>> affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping=.95)
>>> affprop.fit(similarity)
AffinityPropagation(affinity='precomputed', convergence_iter=15, copy=True,
damping=0.95, max_iter=200, preference=None, verbose=False)
>>> print affprop.labels_
[0 0 0 1]
As I mentioned initially, this problem seems to go away as you add more different data to your set. Looking at the example in the question you referenced, we see they initially have the same issue:
>>> c = [[0], [0], [0], [0], [0], [0], [0], [0]]
>>> af = sklearn.cluster.AffinityPropagation (affinity = 'euclidean').fit (c)
>>> print (af.labels_)
[0 1 0 1 2 1 1 0]
This goes away with higher damping:
>>> c = [[0], [0], [0], [0], [0], [0], [0], [0]]
>>> af = sklearn.cluster.AffinityPropagation (affinity = 'euclidean', damping=.99).fit (c)
>>> print (af.labels_)
[0 0 0 0 0 0 0 0]
Or as we introduce more groups:
>>> c = [[0], [0], [0], [1], [2], [1], [2], [1]]
>>> af = sklearn.cluster.AffinityPropagation (affinity = 'euclidean', damping=.5).fit (c)
>>> print (af.labels_)
[0 0 0 2 1 2 1 2]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With