Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding duplicate books

I have a number of list of book's name along with their authors(no ISBN number). I want to maintain a single list containing unique entries of books and remove the duplicate entries of every book.

The problem I am facing is that the different list may follow different conventions to store the book's entries. For e.g - A list might store the author name in last name first name convention, in another list, the name entry of the book itself contains some addition information like the name of the series along with the sequence number.

Is there any standard algorithm to handle such type of problem? I don't want to reinvent the wheel. Right now I am using php to code the solution. As starters, I have tried levenshtein, soundex, metaphone, similar_text but none of them looks promising to me.

Example: Consider an example of Inheritance Cycle, the series contains four books. Now the entry of the second book of the series can be Eldest, Eldest: The Inheritance Cycle (Book 2), Eldest (Inheritance), Eldest (Inheritance Cycle), Inheritance 002: Eldest.

like image 853
Coddy Martin Avatar asked Mar 13 '26 13:03

Coddy Martin


1 Answers

This sounds like a search problem, just with a more constrained domain. I would perhaps use an existing search technology (perhaps using Lucene or Solar) and just iterate through the list, searching for a match first, and then if a sufficiently close one isn't found, adding the "document" (the info you have for one book) to the index.

It won't be a perfect answer, but it will give you a score for various matches, so it gives you some tuneable parameters to work with. This is an especially enticing solution if this is more than a 1-off problem that needs to be solved, since the "algorithm" can learn and tune itself as it goes if needed.

like image 146
cdeszaq Avatar answered Mar 15 '26 01:03

cdeszaq