Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the most suitable string distance algorithm to use for comparing TV show titles?

I'm writing a scraper for TV Shows and other pieces of media (games, movies, etc.), and not all sources are formatted the same way for a certain show. For example, one source might represent subtitles with a dash, others semicolons. I'm currently using Levenshtein distance to compare the scraped data with data extracted from the TV show filename, but I was wondering if the algorithm was designed for short strings less than a sentence long. Is there an algorithm that better suits this need?

like image 840
chyyran Avatar asked Oct 19 '22 06:10

chyyran


1 Answers

Before comparison / distance measurement, you should normalize (standardize) the titles.

Normalization should include things like:

  • Basic Formatting (e.g. UTF16 encoding, No leading/trailing spaces and tabs)
  • Alphabet rules (e.g. Replace Ä with A)
  • Acronym expansion (e.g. NY -> New-York)
  • Location names rules (e.g. City names should not contain spaces, but dashes)
  • Capitalization rules (e.g. Each letter following a dash should be capitalized)
  • Removal of symbols (e.g. !,?)
  • Number conversions ("three-hundred" to "300")
  • Roman numbers conversions (e.g. "Louis XVI" to "Louis 16")
  • Non-American English to American English (e.g. "colour" to "color")
  • Abbreviations rules (e.g. "Inc." instead of "Incorporated", "vs." instead of "versus")

You can use Levenshtein distance between pairs of words (Don't use it for the whole sentence), but implement some sliding window, since certain words (e.g. "The") may be missing from one of the representations.

like image 71
Lior Kogan Avatar answered Oct 21 '22 09:10

Lior Kogan