Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing single or double quotes and allow for escaped characters using regular expressions (in Python)

I have input that looks like a list of arguments:

input1 = '''
title="My First Blog" author='John Doe'
'''

The values can be surrounded by single or double quotes, however, escaping is also allowed:

input2 = '''
title='John\'s First Blog' author="John Doe"
'''

Is there a way to use regular expressions to extract the key value pairs accounting for either single or double quotes and escaped quotes?

Using python, I can use the following regular expression and handle the non-escaped quotes:

rex = r"(\w+)\=(?P<quote>['\"])(.*?)(?P=quote)"

The returns are then:

import re
re.findall(rex, input1)
[('title', '"', 'My First Blog'), ('author', "'", 'John Doe')]

and

import re
re.findall(rex, input2)
[('title', "'", 'John'), ('author', '"', 'John Doe')]

The latter being incorrect. I can't figure out how to handle escaped quotes--assumedly in the (.*?) section. I've been working with the solution in the posted answers on Python regex to match text in single quotes, ignoring escaped quotes (and tabs/newlines) to no avail.

Technically, I don't need findall to return the quote character--rather just the key/value pairs--but that is easily dealt with.

Any help would be appreciated! Thanks!

like image 469
Jeff Avatar asked Nov 05 '12 20:11

Jeff


1 Answers

EDIT

My inital regex solution had a bug in it. That bug masked an error in your input string: input2 is not what you think it is:

>>> input2 = '''
... title='John\'s First Blog' author="John Doe"
... '''
>>> input2      # See - the apostrophe is not correctly escaped!
'\ntitle=\'John\'s First Blog\' author="John Doe"\n'  

You need to make input2 a raw string (or use double backslashes):

>>> input2 = r'''
... title='John\'s First Blog' author="John Doe"
... '''
>>> input2
'\ntitle=\'John\\\'s First Blog\' author="John Doe"\n'

Now you can use a regex that handles escaped quotes correctly:

>>> rex = re.compile(
    r"""(\w+)# Match an identifier (group 1)
    =        # Match =
    (['"])   # Match an opening quote (group 2)
    (        # Match and capture into group 3:
     (?:     # the following regex:
      \\.    # Either an escaped character
     |       # or
      (?!\2) # (as long as we're not right at the matching quote)
      .      # any other character.
     )*      # Repeat as needed
    )        # End of capturing group
    \2       # Match the corresponding closing quote.""", 
    re.DOTALL | re.VERBOSE)
>>> rex.findall(input2)
[('title', "'", "John\\'s First Blog"), ('author', '"', 'John Doe')]
like image 97
Tim Pietzcker Avatar answered Sep 18 '22 14:09

Tim Pietzcker