How to merge similar items in a list

Question

I haven't found anything relevant on Google, so I'm hoping to find some help here :)

I've got a Python list as follows:

[['hoose', 200], ["Bananphone", 10], ['House', 200], ["Bonerphone", 10], ['UniqueValue', 777] ...]

I have a function that returns the Levenshtein distance between 2 strings, for House and hoose it would return 2, etc.

Now I want to merge list elements that have a levenshtein score of f.e. <5, while (!) adding their scores, so for the resulting list I want the following:

[['hoose', 400], ["Bananaphone", 20], ['UniqueValue', 777], ...]

or

[['House', 400], ["Bonerphone", 20], ['UniqueValue', 777], ...]

etc.

It doesn't matter as long as their values get added.

There will only ever be 2 items in the list that are very similar, so a chain effect of any one item similar to a lot of others eating them all up isn't expected.

Björn Pollex · Accepted Answer

To bring home the point from my comment, I just grabbed an implementation of that distance from here, and calculated some distances:

d('House', 'hoose') = 2
d('House', 'trousers') = 4
d('trousers', 'hoose') = 5

Now, suppose your threshold is 4. You would have to merge House and hoose, as well as House and trousers, but not trousers and hoose. Are you sure something like this can never happen with your data?

In the end, I think is more of a clustering problem, so you probably have to look into clustering algorithms. SciPy offers an implementation of hierarchical clustering that works with custom distance functions (be aware that this can be very slow for larger data sets - it also consumes a lot of memory).

The main problem is to decide on a measure for cluster quality, because there is not one correct solution for your problem. This paper(pdf) gives you a starting point, to understand that problem.

How to merge similar items in a list

Tags:

python

arrays

string

algorithm

list

Kami

1 Answers

Björn Pollex

Recent Activity

Donate For Us

How to merge similar items in a list

Tags:

python

arrays

string

algorithm

list

Kami

1 Answers

Björn Pollex

Related questions

Recent Activity

Donate For Us