Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging duplicates in a list? - Question is more complex than it seems

So I have a huge list of entries in a DB (MySql)

I'm using Python and Django in the creation of my web application.

This is the base Django model I'm using:

class DJ(models.Model):
    alias = models.CharField(max_length=255)
    #other fields...

In my DB I have now duplicates

eg. Above & Beyond, Above and Beyond, Above Beyond, DJ Above and Beyond, Disk Jokey Above and Beyond, ...

This is a problem... as it blows a big hole in my DB and therefore my application.


I'm sure other people have encountered this problem and thought about it.

My ideas are the following:

  • Create a set of rules so a new entry cannot be created?

    eg. "DJ Above and Beyond" cannot be created because "Above & Beyond" is in the DB

  • Relate these aliases to each other somehow?

    eg. relate "DJ Above and Beyond" to "Above & Beyond"


I have literally no clue how to go on about this, even if someone could point me into a direction that would be very helpful.

Any help would be very much appreciated! Thank you guys.

like image 723
RadiantHex Avatar asked Dec 13 '22 00:12

RadiantHex


2 Answers

I guess you could do something based on Levenshtein distance, but there's no real way to do this automatically - without creating a fairly complex rules-based system.

Unless you can define a rules system that can work out for any x and y whether x is a duplicate of y, you're going to have to deal with this in a fuzzy, human way.

Stack Overflow has a fairly decent way of dealing with this - warn users if something may be a duplicate, based on something like Levenshtein distance (and perhaps some kind of rules engine), and then allow a subset of your users to merge things as duplicates if other users ignore the warnings.

like image 77
Dominic Rodger Avatar answered Jan 22 '23 14:01

Dominic Rodger


From the examples you give, it sounds like you have more a natural language problem than an exact matching problem. Given that natural language matching is inexact by nature you're unlikely to come up with a perfect solution.

  • String distance doesn't really work, as strings that are algorithmically close may not be semantically close (e.g. "DJ Above & Beyond" should match "Above and Beyond" but not "DJ Above & Beyond 2" which is closer in Levenshtein distance.
  • Some cheap alternatives to natural language parsing are soundex, which will match by phonetic sounds, and Stemming, which removes prefixes/suffixes to normalize on word stems. I suppose you could create a linked list of word roots, but this wouldn't be terribly accurate either.
  • If this is a User-interacting program, you could echo "near misses" to the user, e.g. "Is one of these what you meant to enter?"
  • You could normalize the entries in some way so that different entries map to the same normalized value (e.g. case normalize, "&" -> "And", etc, etc. which some of the above suggestions might be a step towards) to find near misses or map multiple inputs to a single value.

Add the caveat that my experience only applies to English, e.g. an english PorterStemmer won't recognize the one French title you put in there.

like image 34
Steve B. Avatar answered Jan 22 '23 14:01

Steve B.