Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python regex optional capture group

Tags:

python

regex

I have the following problem matching the needed data from filenames like this:

miniseries.season 1.part 5.720p.avi
miniseries.part 5.720p.avi
miniseries.part VII.720p.avi     # episode or season expressed in Roman numerals

The "season XX" chunk may or may not be present or may be written in short form, like "s 1" or "seas 1"

In any case I would like to have 4 capture groups giving as output :

group1 : miniseries
group2 : 1 (or None)
group3 : 5
group4 : 720p.avi

So I've written a regex like this :

(^.*)\Ws[eason ]*(\d{1,2}|[ivxlcdm]{1,5})\Wp[art ]*(\d{1,2}|[ivxlcdm]{1,5})\W(.*$)

This only works when i have a fully specified filename, including the optional "season XX" string. Is it possible to write a regex that returns "None" as group2 if "season" is not found ?

like image 351
user2181741 Avatar asked Mar 18 '13 10:03

user2181741


People also ask

How do I make a group optional in regex Python?

So to make any group optional, we need to have to put a “?” after the pattern or group. This question mark makes the preceding group or pattern optional. This question mark is also known as a quantifier.

What is non-capturing group in regex Python?

This syntax captures whatever match X inside the match so that you can access it via the group() method of the Match object. Sometimes, you may want to create a group but don't want to capture it in the groups of the match. To do that, you can use a non-capturing group with the following syntax: (?:X)

What is a capturing group regex Python?

Introduction to the Python regex capturing groups \w+ is a word character set with a quantifier (+) that matches one or more word characters.


1 Answers

It is easy enough to make the season group optional:

(^.*?)(?:\Ws(?:eason )?(\d{1,2}|[ivxlcdm]{1,5}))?\Wp(?:art )?(\d{1,2}|[ivxlcdm]{1,5})\W(.*$) 

using a non-capturing group ((?:...)) plus the 0 or 1 quantifier (?). I did have to make the first group non-greedy to prevent it from matching the season section of the name.

I also made the eason and art optional strings into non-capturing optional groups instead of character classes.

Result:

>>> import re >>> p=re.compile(r'(^.*?)(?:\Ws(?:eason )?(\d{1,2}|[ivxlcdm]{1,5}))?\Wp(?:art )?(\d{1,2}|[ivxlcdm]{1,5})\W(.*$)', re.I) >>> p.search('miniseries.season 1.part 5.720p.avi').groups() ('miniseries', '1', '5', '720p.avi') >>> p.search('miniseries.part 5.720p.avi').groups() ('miniseries', None, '5', '720p.avi') >>> p.search('miniseries.part VII.720p.avi').groups() ('miniseries', None, 'VII', '720p.avi') 
like image 176
Martijn Pieters Avatar answered Sep 20 '22 10:09

Martijn Pieters