I have a regular expression that can have either from:
(src://path/to/foldernames canhave spaces/file.xzy)
(src://path/to/foldernames canhave spaces/file.xzy "optional string")
These expressions occur within a much longer string (they are not individual strings). I am having trouble matching both expressions when using re.search or re.findall (as there may be multiple expression in the string).
It's straightforward enough to match either individually but how can I go about matching either case so that two groups are returned, the first with src://path/... and the second with the optional string if it exists or None if not?
I am thinking that I need to somehow specify OR groups---for instance, consider:
The pattern \((.*)( ".*")\) matches the second instance but not the first because it does not contain "...".
r = re.search(r'\((.*)( ".*")\)', '(src://path/to/foldernames canhave spaces/file.xzy)'
r.groups() # Nothing found
AttributeError: 'NoneType' object has no attribute 'groups'
While \((.*)( ".*")?\) matches the first group but does not individually identify the "optional string" as a group in the second instance.
r = re.search(r'\((.*)( ".*")?\)', '(src://path/to/foldernames canhave spaces/file.xzy "optional string")')
r.groups()
('src://path/to/foldernames canhave spaces/file.xzy "optional string"', None)
Any thoughts, ye' masters of expressions (of the regular variety)?
The simplest way is to make the first * non-greedy:
>>> import re
>>> string = "(src://path/to/foldernames canhave spaces/file.xzy)"
>>> string2 = \
... '(src://path/to/foldernames canhave spaces/file.xzy "optional string")'
>>> re.findall(r'\((.*?)( ".*")?\)', string2)
[('src://path/to/foldernames canhave spaces/file.xzy', ' "optional string"')]
>>> re.findall(r'\((.*?)( ".*")?\)', string)
[('src://path/to/foldernames canhave spaces/file.xzy', '')]
Since " aren't usually allowed to appear in file names, you can simply exclude them from the first group:
r = re.search(r'\(([^"]*)( ".*")?\)', input)
This is generally the preferred alternative to ungreedy repetition, because tends to be a lot more efficient. If your file names can actually contain quotes for some reason, then ungreedy repetition (as in agf's answer) is your best bet.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With