I'm writing a scraper for TV Shows and other pieces of media (games, movies, etc.), and not all sources are formatted the same way for a certain show. For example, one source might represent subtitles with a dash, others semicolons. I'm currently using Levenshtein distance to compare the scraped data with data extracted from the TV show filename, but I was wondering if the algorithm was designed for short strings less than a sentence long. Is there an algorithm that better suits this need?
Before comparison / distance measurement, you should normalize (standardize) the titles.
Normalization should include things like:
You can use Levenshtein distance between pairs of words (Don't use it for the whole sentence), but implement some sliding window, since certain words (e.g. "The") may be missing from one of the representations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With