Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract movie title from file name

Tags:

python

regex

I'm trying to extract movies metadata (title and year) from their file name.

The name pattern is not standard, but it's not random either, so I'm trying to cover as much cases as I can.
To give you an idea, this are examples of file name:

samples = ['The Movie Title.avi',
           'The Movie Title DVDRIP. Useless.info.avi',
           'The Movie Title [2005].avi',
           'The Movie Title (2005) [Useless.info].avi',
           'The Movie Title 2005 H264 DVDRip Useless-Info.avi',
           'The Movie Title 2005 XviD Useless info.avi',
           'The Movie Title {2005} DVDRIP. UselessInfo.avi',
           'The.Movie.Title.2005.Useless.info.avi',
           '[Useless.info]_The.Movie.Title.2005.Useless.avi']

Anywhere there's UselessInfo it's because what is written there could be anything and can't be use to fetch informations (changes from file to file). Also note that 'The Movie Title' might be something with numbers or non alphabetic charachter, like: The Movie Title 2 - The Return' for example.

The expected output should be a dictionary like:

metadata = {'title': 'The Movie Title', 'year': '2005'}

Right now I'm using a chain of regexp, but I don't know it there's a better way for doing that.

like image 382
Rik Poggi Avatar asked Jan 18 '12 20:01

Rik Poggi


3 Answers

It was a long time ago ! but if someone needs it, I found this python library named PTN very useful ! many thanks to the guy who coded it !

install it : pip install parse-torrent-name

import PTN

torrentName = "[Torrent9.info ] Silicon.Valley.S04E04.VOSTFR.WEB-DL.XviD-T9.avi"

info = PTN.parse(torrentName)

print(info)

Output : {'episode': 4, 'codec': 'XviD', 'title': 'Silicon.Valley.', 'group': 'T9', 'website': 'Torrent9.info', 'excess': 'VOSTFR', 'season': 4, 'quality': 'WEB-DL'}

So its seems to be exactly what you need !

like image 51
Lbrth_BoC Avatar answered Nov 09 '22 13:11

Lbrth_BoC


As you've mentioned in one of comments, the purpose of this "file name processing" into "standardized move title form" is to compare two lists.

With your current approach you can miss a lot of corner cases.

First of all, you need to think carefully what kind of variations do you accept. You've mentioned about different places for "movie" "the" - what about misspellings and case-sensitive ? What about order of words ?

Instead of making your code longer and longer, I'd like recommend you looking for a kind of universal solution.

A few ideas came to my mind - take what you like, mix as you like, heat a little bit and it will be cooked nicely - here we go:

  • LCS : Longest common substring problem, Longest common subsequence problem - useful when:
    • order of words is important.
    • universal, just set how big substring/subsequense has to be as percent of input (max or min or avg or sum of two filenames - your choice)
  • Matching not strings, but sets of words. Thanks to that, you can be resistant to order of words, repetition, and others. As you write in python it's easy to you to make set of sets of words, or map of sets of words. Here are few hints:
    • For each movie - instead of regexp-ing whole string: (1) Split movie filename into words (2) Eliminate: "the", "movie", etc (3) cut-out most important parts ( "walking" - "ing" -> "walk" etc ). (4) put words left into set (5) resulting set is set, that represent movie.
    • For each list: All movies' filenames convert into sets (as above), and all of those sets put into set (now you have set of sets of string - yeah)
    • For list A and B : just do A ^ B or A - B , again - what you need (checkout Python Manual: Sets.
  • If you need later to revert set representing movie into movie filename. During creation of lists A,B you need to create maps MA,MB that will map for you "set of words" into "filename".
  • Again LCS, but now imagine your alphabet are words. If you're not familiar with Formal langages terminology - imagine that your movie name is written with special letters, each letter is exactly one word. Thanks to that you have sequence of words, and you can search for subsequence of words. Now applying LCS will give you "Longest Common Set of Words Preserving Order" in movie title.
like image 41
Grzegorz Wierzowiecki Avatar answered Nov 09 '22 12:11

Grzegorz Wierzowiecki


Why not downloading a database (perhaps on Wikipedia) with a list of movie names and dates, and then comparing the filenames with this list? There are so many edge cases that it may be more efficient.

like image 32
charlax Avatar answered Nov 09 '22 13:11

charlax