What is the most suitable string distance algorithm to use for comparing TV show titles?

Question

I'm writing a scraper for TV Shows and other pieces of media (games, movies, etc.), and not all sources are formatted the same way for a certain show. For example, one source might represent subtitles with a dash, others semicolons. I'm currently using Levenshtein distance to compare the scraped data with data extracted from the TV show filename, but I was wondering if the algorithm was designed for short strings less than a sentence long. Is there an algorithm that better suits this need?

Lior Kogan · Accepted Answer

Before comparison / distance measurement, you should normalize (standardize) the titles.

Normalization should include things like:

Basic Formatting (e.g. UTF16 encoding, No leading/trailing spaces and tabs)
Alphabet rules (e.g. Replace Ä with A)
Acronym expansion (e.g. NY -> New-York)
Location names rules (e.g. City names should not contain spaces, but dashes)
Capitalization rules (e.g. Each letter following a dash should be capitalized)
Removal of symbols (e.g. !,?)
Number conversions ("three-hundred" to "300")
Roman numbers conversions (e.g. "Louis XVI" to "Louis 16")
Non-American English to American English (e.g. "colour" to "color")
Abbreviations rules (e.g. "Inc." instead of "Incorporated", "vs." instead of "versus")

You can use Levenshtein distance between pairs of words (Don't use it for the whole sentence), but implement some sliding window, since certain words (e.g. "The") may be missing from one of the representations.

What is the most suitable string distance algorithm to use for comparing TV show titles?

Tags:

string

algorithm

levenshtein-distance

chyyran

1 Answers

Lior Kogan

Recent Activity

Donate For Us

What is the most suitable string distance algorithm to use for comparing TV show titles?

Tags:

string

algorithm

levenshtein-distance

chyyran

1 Answers

Lior Kogan

Related questions

Recent Activity

Donate For Us