I have input that looks like a list of arguments:
input1 = '''
title="My First Blog" author='John Doe'
'''
The values can be surrounded by single or double quotes, however, escaping is also allowed:
input2 = '''
title='John\'s First Blog' author="John Doe"
'''
Is there a way to use regular expressions to extract the key value pairs accounting for either single or double quotes and escaped quotes?
Using python, I can use the following regular expression and handle the non-escaped quotes:
rex = r"(\w+)\=(?P<quote>['\"])(.*?)(?P=quote)"
The returns are then:
import re
re.findall(rex, input1)
[('title', '"', 'My First Blog'), ('author', "'", 'John Doe')]
and
import re
re.findall(rex, input2)
[('title', "'", 'John'), ('author', '"', 'John Doe')]
The latter being incorrect. I can't figure out how to handle escaped quotes--assumedly in the (.*?) section. I've been working with the solution in the posted answers on Python regex to match text in single quotes, ignoring escaped quotes (and tabs/newlines) to no avail.
Technically, I don't need findall to return the quote character--rather just the key/value pairs--but that is easily dealt with.
Any help would be appreciated! Thanks!
EDIT
My inital regex solution had a bug in it. That bug masked an error in your input string: input2
is not what you think it is:
>>> input2 = '''
... title='John\'s First Blog' author="John Doe"
... '''
>>> input2 # See - the apostrophe is not correctly escaped!
'\ntitle=\'John\'s First Blog\' author="John Doe"\n'
You need to make input2
a raw string (or use double backslashes):
>>> input2 = r'''
... title='John\'s First Blog' author="John Doe"
... '''
>>> input2
'\ntitle=\'John\\\'s First Blog\' author="John Doe"\n'
Now you can use a regex that handles escaped quotes correctly:
>>> rex = re.compile(
r"""(\w+)# Match an identifier (group 1)
= # Match =
(['"]) # Match an opening quote (group 2)
( # Match and capture into group 3:
(?: # the following regex:
\\. # Either an escaped character
| # or
(?!\2) # (as long as we're not right at the matching quote)
. # any other character.
)* # Repeat as needed
) # End of capturing group
\2 # Match the corresponding closing quote.""",
re.DOTALL | re.VERBOSE)
>>> rex.findall(input2)
[('title', "'", "John\\'s First Blog"), ('author', '"', 'John Doe')]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With