I'm trying to extract movies metadata (title and year) from their file name.
The name pattern is not standard, but it's not random either, so I'm trying to cover as much cases as I can.
To give you an idea, this are examples of file name:
samples = ['The Movie Title.avi',
'The Movie Title DVDRIP. Useless.info.avi',
'The Movie Title [2005].avi',
'The Movie Title (2005) [Useless.info].avi',
'The Movie Title 2005 H264 DVDRip Useless-Info.avi',
'The Movie Title 2005 XviD Useless info.avi',
'The Movie Title {2005} DVDRIP. UselessInfo.avi',
'The.Movie.Title.2005.Useless.info.avi',
'[Useless.info]_The.Movie.Title.2005.Useless.avi']
Anywhere there's UselessInfo
it's because what is written there could be anything and can't be use to fetch informations (changes from file to file). Also note that 'The Movie Title'
might be something with numbers or non alphabetic charachter, like: The Movie Title 2 - The Return'
for example.
The expected output should be a dictionary like:
metadata = {'title': 'The Movie Title', 'year': '2005'}
Right now I'm using a chain of regexp, but I don't know it there's a better way for doing that.
It was a long time ago ! but if someone needs it, I found this python library named PTN very useful ! many thanks to the guy who coded it !
install it : pip install parse-torrent-name
import PTN
torrentName = "[Torrent9.info ] Silicon.Valley.S04E04.VOSTFR.WEB-DL.XviD-T9.avi"
info = PTN.parse(torrentName)
print(info)
Output : {'episode': 4, 'codec': 'XviD', 'title': 'Silicon.Valley.', 'group': 'T9', 'website': 'Torrent9.info', 'excess': 'VOSTFR', 'season': 4, 'quality': 'WEB-DL'}
So its seems to be exactly what you need !
As you've mentioned in one of comments, the purpose of this "file name processing" into "standardized move title form" is to compare two lists.
With your current approach you can miss a lot of corner cases.
First of all, you need to think carefully what kind of variations do you accept. You've mentioned about different places for "movie" "the" - what about misspellings and case-sensitive ? What about order of words ?
Instead of making your code longer and longer, I'd like recommend you looking for a kind of universal solution.
A few ideas came to my mind - take what you like, mix as you like, heat a little bit and it will be cooked nicely - here we go:
A ^ B
or A - B
, again - what you need (checkout Python Manual: Sets.Why not downloading a database (perhaps on Wikipedia) with a list of movie names and dates, and then comparing the filenames with this list? There are so many edge cases that it may be more efficient.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With