I have a naive "parser" that simply does something like: <code>[x.split('=') for x in mystring.split(',')]</code> However mystring can be something like <code>'foo=bar,breakfast=spam,eggs'</code> Obviously, The naive splitter will just not do it. I am limited to Python 2.6 standard library for this, So for example pyparsing can not be used. Expected output is <code>[('foo', 'bar'), ('breakfast', 'spam,eggs')]</code> I'm trying to do this with regex, but am facing the following problems: My First attempt <code>r'([a-z_]+)=(.+),?'</code> Gave me <code>[('foo', 'bar,breakfast=spam,eggs')]</code> Obviously, Making <code>.+</code> non-greedy does not solve the problem. So, I'm guessing I have to somehow make the last comma (or <code>$</code>) mandatory. Doing just that does not really work, <code>r'([a-z_]+)=(.+?)(?:,|$)'</code> As with that the stuff behind the comma in an value containing one is omitted, e.g. <code>[('foo', 'bar'), ('breakfast', 'spam')]</code> I think I must use some sort of look-behind(?) operation. The Question(s) 1. Which one do I use? or 2. How do I do that/this? Edit: Based on daramarak's answer below, I ended up doing pretty much the same thing as abarnert later suggested in a slightly more verbose form; <pre class="prettyprint"><code>vals = [x.rsplit(',', 1) for x in (data.split('='))] ret = list() while vals: value = vals.pop()[0] key = vals[-1].pop() ret.append((key, value)) if len(vals[-1]) == 0: break </code></pre> EDIT 2: Just to satisfy my curiosity, is this actually possible with pure regular expressions? I.e so that <code>re.findall()</code> would return a list of 2-tuples?

daramarak's answer either very nearly works, or works as-is; it's hard to tell from the way the sample output is formatted and the vague descriptions of the steps. But if it's the very-nearly-works version, it's easy to fix. Putting it into code: <pre class="prettyprint"><code>>>> bits=[x.rsplit(',', 1) for x in s.split('=')] >>> kv = [(bits[i][-1], bits[i+1][0]) for i in range(len(bits)-1)] </code></pre> The first line is (I believe) daramarak's answer. By itself, the first line gives you pairs of <code>(value_i, key_i+1)</code> instead of <code>(key_i, value_i)</code>. The second line is the most obvious fix for that. With more intermediate steps, and a bit of output, to see how it works: <pre class="prettyprint"><code>>>> s = 'foo=bar,breakfast=spam,eggs,blt=bacon,lettuce,tomato,spam=spam' >>> bits0 = s.split('=') >>> bits0 ['foo', 'bar,breakfast', 'spam,eggs,blt', 'bacon,lettuce,tomato,spam', 'spam'] >>> bits = [x.rsplit(',', 1) for x in bits0] >>> bits [('foo'), ('bar', 'breakfast'), ('spam,eggs', 'blt'), ('bacon,lettuce,tomato', 'spam'), ('spam')] >>> kv = [(bits[i][-1], bits[i+1][0]) for i in range(len(bits)-1)] >>> kv [('foo', 'bar'), ('breakfast', 'spam,eggs'), ('blt', 'bacon,lettuce,tomato'), ('spam', 'spam')] </code></pre>

Regular expression to match comma separated list of key=value where value can contain commas

Tags:

python

regex

parsing

python-2.6

I have a naive "parser" that simply does something like:
[x.split('=') for x in mystring.split(',')]

However mystring can be something like
'foo=bar,breakfast=spam,eggs'

Obviously,
The naive splitter will just not do it. I am limited to Python 2.6 standard library for this,
So for example pyparsing can not be used.

Expected output is
[('foo', 'bar'), ('breakfast', 'spam,eggs')]

I'm trying to do this with regex, but am facing the following problems:

My First attempt
r'([a-z_]+)=(.+),?'
Gave me
[('foo', 'bar,breakfast=spam,eggs')]

Obviously,
Making .+ non-greedy does not solve the problem.

So,
I'm guessing I have to somehow make the last comma (or $) mandatory.
Doing just that does not really work,
r'([a-z_]+)=(.+?)(?:,|$)'
As with that the stuff behind the comma in an value containing one is omitted,
e.g. [('foo', 'bar'), ('breakfast', 'spam')]

I think I must use some sort of look-behind(?) operation.
The Question(s)
1. Which one do I use? or
2. How do I do that/this?

Edit:

Based on daramarak's answer below,
I ended up doing pretty much the same thing as abarnert later suggested in a slightly more verbose form;

vals = [x.rsplit(',', 1) for x in (data.split('='))]
ret = list()
while vals:
    value = vals.pop()[0]
    key = vals[-1].pop()
    ret.append((key, value))
    if len(vals[-1]) == 0:
        break

EDIT 2:

Just to satisfy my curiosity, is this actually possible with pure regular expressions? I.e so that re.findall() would return a list of 2-tuples?

286

asked Feb 01 '13 07:02

Kimvais

2 Answers

Just for comparison purposes, here's a regex that seems to solve the problem as well:

([^=]+)    # key
=          # equals is how we tokenise the original string
([^=]+)    # value
(?:,|$)    # value terminator, either comma or end of string

The trick here it to restrict what you're capturing in your second group. .+ swallows the = sign, which is the character we can use to distinguish keys from values. The full regex doesn't rely on any back-tracking (so it should be compatible with something like re2, if that's desirable) and can work on abarnert's examples.

Usage as follows:

re.findall(r'([^=]+)=([^=]+)(?:,|$)', 'foo=bar,breakfast=spam,eggs,blt=bacon,lettuce,tomato,spam=spam')

Which returns:

[('foo', 'bar'), ('breakfast', 'spam,eggs'), ('blt', 'bacon,lettuce,tomato'), ('spam', 'spam')]

111

answered Sep 28 '22 02:09

ig0774

daramarak's answer either very nearly works, or works as-is; it's hard to tell from the way the sample output is formatted and the vague descriptions of the steps. But if it's the very-nearly-works version, it's easy to fix.

Putting it into code:

>>> bits=[x.rsplit(',', 1) for x in s.split('=')]
>>> kv = [(bits[i][-1], bits[i+1][0]) for i in range(len(bits)-1)]

The first line is (I believe) daramarak's answer. By itself, the first line gives you pairs of (value_i, key_i+1) instead of (key_i, value_i). The second line is the most obvious fix for that. With more intermediate steps, and a bit of output, to see how it works:

>>> s = 'foo=bar,breakfast=spam,eggs,blt=bacon,lettuce,tomato,spam=spam'
>>> bits0 = s.split('=')
>>> bits0
['foo', 'bar,breakfast', 'spam,eggs,blt', 'bacon,lettuce,tomato,spam', 'spam']
>>> bits = [x.rsplit(',', 1) for x in bits0]
>>> bits
[('foo'), ('bar', 'breakfast'), ('spam,eggs', 'blt'), ('bacon,lettuce,tomato', 'spam'), ('spam')]
>>> kv = [(bits[i][-1], bits[i+1][0]) for i in range(len(bits)-1)]
>>> kv
[('foo', 'bar'), ('breakfast', 'spam,eggs'), ('blt', 'bacon,lettuce,tomato'), ('spam', 'spam')]

answered Sep 28 '22 01:09

abarnert

Related questions
                            
                                What's the Groovy equivalent to Python's dir()?
                            
                                How to write/read a Pandas DataFrame with MultiIndex from/to an ASCII file?
                            
                                Python: Dynamic "from" import
                            
                                Dealing with multiple python versions when python files have to use #!/bin/env python
                            
                                Why python-cgi fails on unicode?
                            
                                SqlAlchemy , AttributeError: 'tuple' object has no attribute 'foreign_keys'
                            
                                Safely iterating over WeakKeyDictionary and WeakValueDictionary
                            
                                Passing the '+' character in a POST request in Python
                            
                                Error on running transaction with multiple entity groups through nosetests
                            
                                Python unittest: cancel all tests if a specific test fails
                            
                                How to plot line (polygonal chain) with numpy/scipy/matplotlib with minimal smoothing
                            
                                SqlAlchemy: filter to match all instead of any values in list?
                            
                                What does 'result[::-1]' mean?
                            
                                Argparse subparser: hide metavar in command listing
                            
                                python dynamically create class with inner class
                            
                                Python: Memory leak?
                            
                                n-grams with Naive Bayes classifier
                            
                                Generate list of range tuples with given boundaries in python
                            
                                How to store private key on Heroku?
                            
                                Multi-variable List Comprehension

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With