Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python re, find expression containing an optional group

Tags:

python

regex

I have a regular expression that can have either from:

(src://path/to/foldernames canhave spaces/file.xzy)
(src://path/to/foldernames canhave spaces/file.xzy "optional string")

These expressions occur within a much longer string (they are not individual strings). I am having trouble matching both expressions when using re.search or re.findall (as there may be multiple expression in the string).

It's straightforward enough to match either individually but how can I go about matching either case so that two groups are returned, the first with src://path/... and the second with the optional string if it exists or None if not?

I am thinking that I need to somehow specify OR groups---for instance, consider:

The pattern \((.*)( ".*")\) matches the second instance but not the first because it does not contain "...".

r = re.search(r'\((.*)( ".*")\)', '(src://path/to/foldernames canhave spaces/file.xzy)'
r.groups()  # Nothing found
AttributeError: 'NoneType' object has no attribute 'groups'

While \((.*)( ".*")?\) matches the first group but does not individually identify the "optional string" as a group in the second instance.

r = re.search(r'\((.*)( ".*")?\)', '(src://path/to/foldernames canhave spaces/file.xzy "optional string")')
r.groups()
('src://path/to/foldernames canhave spaces/file.xzy "optional string"', None)

Any thoughts, ye' masters of expressions (of the regular variety)?

like image 321
BFTM Avatar asked Jun 03 '26 21:06

BFTM


2 Answers

The simplest way is to make the first * non-greedy:

>>> import re
>>> string = "(src://path/to/foldernames canhave spaces/file.xzy)"
>>> string2 = \
... '(src://path/to/foldernames canhave spaces/file.xzy "optional string")'
>>> re.findall(r'\((.*?)( ".*")?\)', string2)
[('src://path/to/foldernames canhave spaces/file.xzy', ' "optional string"')]
>>> re.findall(r'\((.*?)( ".*")?\)', string)
[('src://path/to/foldernames canhave spaces/file.xzy', '')]
like image 52
agf Avatar answered Jun 06 '26 10:06

agf


Since " aren't usually allowed to appear in file names, you can simply exclude them from the first group:

r = re.search(r'\(([^"]*)( ".*")?\)', input)

This is generally the preferred alternative to ungreedy repetition, because tends to be a lot more efficient. If your file names can actually contain quotes for some reason, then ungreedy repetition (as in agf's answer) is your best bet.

like image 30
Martin Ender Avatar answered Jun 06 '26 10:06

Martin Ender