There's a logfile with text in the form of space-separated <code>key=value</code> pairs, and each line was originally serialized from data in a Python dict, something like: <pre class="prettyprint"><code>' '.join([f'{k}={v!r}' for k,v in d.items()]) </code></pre> The keys are always just strings. The values could be anything that <code>ast.literal_eval</code> can successfully parse, no more no less. How to process this logfile and turn the lines back into Python dicts? Example: <pre class="prettyprint"><code>>>> to_dict("key='hello world'") {'key': 'hello world'} >>> to_dict("k1='v1' k2='v2'") {'k1': 'v1', 'k2': 'v2'} >>> to_dict("s='1234' n=1234") {'s': '1234', 'n': 1234} >>> to_dict("""k4='k5="hello"' k5={'k6': ['potato']}""") {'k4': 'k5="hello"', 'k5': {'k6': ['potato']}} </code></pre> Here is some extra context about the data: <ul> <li>Keys are valid names </li> <li>Input lines are well-formed (e.g. no dangling brackets)</li> <li>The data is trusted (unsafe functions such as <code>eval</code>, <code>exec</code>, <code>yaml.load</code> are OK to use)</li> <li>Order is not important. Performance is not important. Correctness is important.</li> </ul> Edit: As requested in the comments, here is an MCVE and an example code that didn't work correctly <pre class="prettyprint"><code>>>> def to_dict(s): ... s = s.replace(' ', ', ') ... return eval(f"dict({s})") ... ... >>> to_dict("k1='v1' k2='v2'") {'k1': 'v1', 'k2': 'v2'} # OK >>> to_dict("s='1234' n=1234") {'s': '1234', 'n': 1234} # OK >>> to_dict("key='hello world'") {'key': 'hello, world'} # Incorrect, the value was corrupted </code></pre>

Regex replacement functions to the rescue I'm not rewriting a ast-like parser for you, but one trick that works pretty well is to use regular expressions to replace the quoted strings and replace them by "variables" (I've chosen <code>__token(number)__</code>), a bit like you're offuscating some code. Make a note of the strings you're replacing (that should take care of the spaces), replace space by comma (protecting against symbols before like <code>:</code> allows to pass last test) and replace by strings again. <pre class="prettyprint"><code>import re,itertools def to_dict(s): rep_dict = {} cnt = itertools.count() def rep_func(m): rval = "__token{}__".format(next(cnt)) rep_dict[rval] = m.group(0) return rval # replaces single/double quoted strings by token variable-like idents # going on a limb to support escaped quotes in the string and double escapes at the end of the string s = re.sub(r"(['\"]).*?([^\\]|\\\\)\1",rep_func,s) # replaces spaces that follow a letter/digit/underscore by comma s = re.sub("(\w)\s+",r"\1,",s) #print("debug",s) # uncomment to see temp string # put back the original strings s = re.sub("__token\d+__",lambda m : rep_dict[m.group(0)],s) return eval("dict({s})".format(s=s)) print(to_dict("k1='v1' k2='v2'")) print(to_dict("s='1234' n=1234")) print(to_dict(r"key='hello world'")) print(to_dict('key="hello world"')) print(to_dict("""k4='k5="hello"' k5={'k6': ['potato']}""")) # extreme string test print(to_dict(r"key='hello \'world\\'")) </code></pre> prints: <pre class="prettyprint"><code>{'k2': 'v2', 'k1': 'v1'} {'n': 1234, 's': '1234'} {'key': 'hello world'} {'key': 'hello world'} {'k5': {'k6': ['potato']}, 'k4': 'k5="hello"'} {'key': "hello 'world\\"} </code></pre> The key is to extract the strings (quoted/double quoted) using non-greedy regex and replace them by non-strings (like if those were string variables not literals) in the expression. The regex has been tuned so it can accept escaped quotes and double escape at the end of string (custom solution) The replacement function is an inner function so it can make use of the nonlocal dictionary & counter and track the replaced text, so it can be restored once the spaces have been taken care of. When replacing the spaces by commas, you have to be careful not to do it after a colon (last test) or all things considered after a alphanum/underscore (hence the <code>\w</code> protection in the replacement regex for comma) If we uncomment the debug print code just before the original strings are put back that prints: <pre class="prettyprint"><code>debug k1=__token0__,k2=__token1__ debug s=__token0__,n=1234 debug key=__token0__ debug k4=__token0__,k5={__token1__: [__token2__]} debug key=__token0__ </code></pre> The strings have been pwned, and the replacement of spaces has worked properly. With some more effort, it should probably be possible to quote the keys and replace <code>k1=</code> by <code>"k1":</code> so <code>ast.literal_eval</code> can be used instead of <code>eval</code> (more risky, and not required here) I'm sure some super-complex expressions can break my code (I've even heard that there are very few json parsers able to parse 100% of the valid json files), but for the tests you submitted, it'll work (of course if some funny guy tries to put <code>__tokenxx__</code> idents in the original strings, that'll fail, maybe it could be replaced by some otherwise invalid-as-variable placeholders). I have built an Ada lexer using this technique some time ago to be able to avoid spaces in strings and that worked pretty well.

Converting key=value pairs back into Python dicts

Tags:

python

string

dictionary

logging

parsing

There's a logfile with text in the form of space-separated key=value pairs, and each line was originally serialized from data in a Python dict, something like:

' '.join([f'{k}={v!r}' for k,v in d.items()])

The keys are always just strings. The values could be anything that ast.literal_eval can successfully parse, no more no less.

How to process this logfile and turn the lines back into Python dicts? Example:

>>> to_dict("key='hello world'")
{'key': 'hello world'}

>>> to_dict("k1='v1' k2='v2'")
{'k1': 'v1', 'k2': 'v2'}

>>> to_dict("s='1234' n=1234")
{'s': '1234', 'n': 1234}

>>> to_dict("""k4='k5="hello"' k5={'k6': ['potato']}""")
{'k4': 'k5="hello"', 'k5': {'k6': ['potato']}}

Here is some extra context about the data:

Keys are valid names
Input lines are well-formed (e.g. no dangling brackets)
The data is trusted (unsafe functions such as eval, exec, yaml.load are OK to use)
Order is not important. Performance is not important. Correctness is important.

Edit: As requested in the comments, here is an MCVE and an example code that didn't work correctly

>>> def to_dict(s):
...     s = s.replace(' ', ', ')
...     return eval(f"dict({s})")
... 
... 
>>> to_dict("k1='v1' k2='v2'")
{'k1': 'v1', 'k2': 'v2'}  # OK
>>> to_dict("s='1234' n=1234")
{'s': '1234', 'n': 1234}  # OK
>>> to_dict("key='hello world'")
{'key': 'hello, world'}  # Incorrect, the value was corrupted

608

asked Oct 19 '18 19:10

wim

4 Answers

Your input can't be conveniently parsed by something like ast.literal_eval, but it can be tokenized as a series of Python tokens. This makes things a bit easier than they might otherwise be.

The only place = tokens can appear in your input is as key-value separators; at least for now, ast.literal_eval doesn't accept anything with = tokens in it. We can use the = tokens to determine where the key-value pairs start and end, and most of the rest of the work can be handled by ast.literal_eval. Using the tokenize module also avoids problems with = or backslash escapes in string literals.

import ast
import io
import tokenize

def todict(logstring):
    # tokenize.tokenize wants an argument that acts like the readline method of a binary
    # file-like object, so we have to do some work to give it that.
    input_as_file = io.BytesIO(logstring.encode('utf8'))
    tokens = list(tokenize.tokenize(input_as_file.readline))

    eqsign_locations = [i for i, token in enumerate(tokens) if token[1] == '=']

    names = [tokens[i-1][1] for i in eqsign_locations]

    # Values are harder than keys.
    val_starts = [i+1 for i in eqsign_locations]
    val_ends = [i-1 for i in eqsign_locations[1:]] + [len(tokens)]

    # tokenize.untokenize likes to add extra whitespace that ast.literal_eval
    # doesn't like. Removing the row/column information from the token records
    # seems to prevent extra leading whitespace, but the documentation doesn't
    # make enough promises for me to be comfortable with that, so we call
    # strip() as well.
    val_strings = [tokenize.untokenize(tok[:2] for tok in tokens[start:end]).strip()
                   for start, end in zip(val_starts, val_ends)]
    vals = [ast.literal_eval(val_string) for val_string in val_strings]

    return dict(zip(names, vals))

This behaves correctly on your example inputs, as well as on an example with backslashes:

>>> todict("key='hello world'")
{'key': 'hello world'}
>>> todict("k1='v1' k2='v2'")
{'k1': 'v1', 'k2': 'v2'}
>>> todict("s='1234' n=1234")
{'s': '1234', 'n': 1234}
>>> todict("""k4='k5="hello"' k5={'k6': ['potato']}""")
{'k4': 'k5="hello"', 'k5': {'k6': ['potato']}}
>>> s=input()
a='=' b='"\'' c=3
>>> todict(s)
{'a': '=', 'b': '"\'', 'c': 3}

Incidentally, we probably could look for token type NAME instead of = tokens, but that'll break if they ever add set() support to literal_eval. Looking for = could also break in the future, but it doesn't seem as likely to break as looking for NAME tokens.

answered Oct 04 '22 21:10

user2357112 supports Monica

Regex replacement functions to the rescue

I'm not rewriting a ast-like parser for you, but one trick that works pretty well is to use regular expressions to replace the quoted strings and replace them by "variables" (I've chosen __token(number)__), a bit like you're offuscating some code.

Make a note of the strings you're replacing (that should take care of the spaces), replace space by comma (protecting against symbols before like : allows to pass last test) and replace by strings again.

import re,itertools

def to_dict(s):
    rep_dict = {}
    cnt = itertools.count()
    def rep_func(m):
        rval = "__token{}__".format(next(cnt))
        rep_dict[rval] = m.group(0)
        return rval

    # replaces single/double quoted strings by token variable-like idents
    # going on a limb to support escaped quotes in the string and double escapes at the end of the string
    s = re.sub(r"(['\"]).*?([^\\]|\\\\)\1",rep_func,s)
    # replaces spaces that follow a letter/digit/underscore by comma
    s = re.sub("(\w)\s+",r"\1,",s)
    #print("debug",s)   # uncomment to see temp string
    # put back the original strings
    s = re.sub("__token\d+__",lambda m : rep_dict[m.group(0)],s)

    return eval("dict({s})".format(s=s))

print(to_dict("k1='v1' k2='v2'"))
print(to_dict("s='1234' n=1234"))
print(to_dict(r"key='hello world'"))
print(to_dict('key="hello world"'))
print(to_dict("""k4='k5="hello"' k5={'k6': ['potato']}"""))
# extreme string test
print(to_dict(r"key='hello \'world\\'"))

prints:

{'k2': 'v2', 'k1': 'v1'}
{'n': 1234, 's': '1234'}
{'key': 'hello world'}
{'key': 'hello world'}
{'k5': {'k6': ['potato']}, 'k4': 'k5="hello"'}
{'key': "hello 'world\\"}

The key is to extract the strings (quoted/double quoted) using non-greedy regex and replace them by non-strings (like if those were string variables not literals) in the expression. The regex has been tuned so it can accept escaped quotes and double escape at the end of string (custom solution)

The replacement function is an inner function so it can make use of the nonlocal dictionary & counter and track the replaced text, so it can be restored once the spaces have been taken care of.

When replacing the spaces by commas, you have to be careful not to do it after a colon (last test) or all things considered after a alphanum/underscore (hence the \w protection in the replacement regex for comma)

If we uncomment the debug print code just before the original strings are put back that prints:

debug k1=__token0__,k2=__token1__
debug s=__token0__,n=1234
debug key=__token0__
debug k4=__token0__,k5={__token1__: [__token2__]}
debug key=__token0__

The strings have been pwned, and the replacement of spaces has worked properly. With some more effort, it should probably be possible to quote the keys and replace k1= by "k1": so ast.literal_eval can be used instead of eval (more risky, and not required here)

I'm sure some super-complex expressions can break my code (I've even heard that there are very few json parsers able to parse 100% of the valid json files), but for the tests you submitted, it'll work (of course if some funny guy tries to put __tokenxx__ idents in the original strings, that'll fail, maybe it could be replaced by some otherwise invalid-as-variable placeholders). I have built an Ada lexer using this technique some time ago to be able to avoid spaces in strings and that worked pretty well.

answered Oct 04 '22 22:10

Jean-François Fabre

You can find all the occurrences of = characters, and then find the maximum runs of characters which give a valid ast.literal_eval result. Those characters can then be parsed for the value, associated with a key found by a string slice between the last successful parse and the index of the current =:

import ast, typing
def is_valid(_str:str) -> bool:  
  try:
     _ = ast.literal_eval(_str)
  except:
    return False
  else:
    return True

def parse_line(_d:str) -> typing.Generator[typing.Tuple, None, None]:
  _eq, last = [i for i, a in enumerate(_d) if a == '='], 0
  for _loc in _eq:
     if _loc >= last:
       _key = _d[last:_loc]
       _inner, seen, _running, _worked = _loc+1, '', _loc+2, []
       while True:
         try:
            val = ast.literal_eval(_d[_inner:_running])
         except:
            _running += 1
         else:
            _max = max([i for i in range(len(_d[_inner:])) if is_valid(_d[_inner:_running+i])])
            yield (_key, ast.literal_eval(_d[_inner:_running+_max]))
            last = _running+_max
            break


def to_dict(_d:str) -> dict:
  return dict(parse_line(_d))

print([to_dict("key='hello world'"), 
       to_dict("k1='v1' k2='v2'"), 
       to_dict("s='1234' n=1234"), 
       to_dict("""k4='k5="hello"' k5={'k6': ['potato']}"""),
       to_dict("val=['100', 100, 300]"),
       to_dict("val=[{'t':{32:45}, 'stuff':100, 'extra':[]}, 100, 300]")
   ]

)

Output:

{'key': 'hello world'}
{'k1': 'v1', 'k2': 'v2'}
{'s': '1234', 'n': 1234}
{'k4': 'k5="hello"', 'k5': {'k6': ['potato']}}
{'val': ['100', 100, 300]}
{'val': [{'t': {32: 45}, 'stuff': 100, 'extra': []}, 100, 300]}

Disclaimer:

This solution is not as elegant as @Jean-FrançoisFabre's, and I am not sure if it can parse 100% of what is passed to to_dict, but it may give you inspiration for your own version.

answered Oct 04 '22 22:10

Ajax1234

Provide two helper functions.

popstr: split thing from start of string that looks like string
If it starts with a single or double quote mark, I'll look for the next one and split at that point.
```
def popstr(s):
    i = s[1:].find(s[0]) + 2
    return s[:i], s[i:]
```
poptrt: split thing from start of string that is surrounded by brackets ('[]', '()', '{}').
If it starts with a bracket, I'll start incrementing for every instance of the starting character and decrementing for every instance of it's complement. When I reach zero, I split.

def poptrt(s): d = {'{': '}', '[': ']', '(': ')'} b = s[0] c = lambda x: {b: 1, d[b]: -1}.get(x, 0) parts = [] t, i = 1, 1 while t > 0 and s: if i > len(s) - 1: break elif s[i] in '\'"': s, s, s = s[:i], *map(str.strip, popstr(s[i:])) parts.extend([s, s]) i = 0 else: t += c(s[i]) i += 1 if t == 0: return ''.join(parts + [s[:i]]), s[i:] else: raise ValueError('Your string has unbalanced brackets.')

Chew through string until there is no more string to chew

def to_dict(log):
    d = {}
    while log:
        k, log = map(str.strip, log.split('=', 1))
        if log.startswith(('"', "'")):
            v, log = map(str.strip, popstr(log))
        elif log.startswith((*'{[(',)):
            v, log = map(str.strip, poptrt(log))
        else:
            v, *log = map(str.strip, log.split(None, 1))
            log = ' '.join(log)
        d[k] = ast.literal_eval(v)
    return d

All tests passed

assert to_dict("key='hello world'") == {'key': 'hello world'}
assert to_dict("k1='v1' k2='v2'") == {'k1': 'v1', 'k2': 'v2'}
assert to_dict("s='1234' n=1234") == {'s': '1234', 'n': 1234}
assert to_dict("""k4='k5="hello"' k5={'k6': ['potato']}""") == {'k4': 'k5="hello"', 'k5': {'k6': ['potato']}}

Deficiencies

Did not account for backslashes
Did not account for nested goofy formatting

All Together

import ast

def popstr(s):
    i = s[1:].find(s[0]) + 2
    return s[:i], s[i:]

def poptrt(s):
    d = {'{': '}', '[': ']', '(': ')'}
    b = s[0]
    c = lambda x: {b: 1, d[b]: -1}.get(x, 0)
    parts = []
    t, i = 1, 1
    while t > 0 and s:
        if i > len(s) - 1:
            break
        elif s[i] in '\'"':
            _s, s_, s = s[:i], *map(str.strip, popstr(s[i:]))
            parts.extend([_s, s_])
            i = 0
        else:
            t += c(s[i])
            i += 1
    if t == 0:
        return ''.join(parts + [s[:i]]), s[i:]
    else:
        raise ValueError('Your string has unbalanced brackets.')

def to_dict(log):
    d = {}
    while log:
        k, log = map(str.strip, log.split('=', 1))
        if log.startswith(('"', "'")):
            v, log = map(str.strip, popstr(log))
        elif log.startswith((*'{[(',)):
            v, log = map(str.strip, poptrt(log))
        else:
            v, *log = map(str.strip, log.split(None, 1))
            log = ' '.join(log)
        d[k] = ast.literal_eval(v)
    return d

answered Oct 04 '22 22:10

piRSquared

Related questions
                            
                                Python - error on 'get_sheet_by_name'
                            
                                Django Rest Framework "A valid integer is required."?
                            
                                How to send a message with discord.py without a command
                            
                                How to clear tf.flags?
                            
                                how to preserve dtypes of dataframes when using to_csv?
                            
                                how can pylint msg "too many local variables" be disabled
                            
                                tkinter resize frame and contents with main window
                            
                                Django ModelForm inheritance and Meta inheritance
                            
                                Draw a Line in Pygame
                            
                                for loops in Python - how to modify i inside the loop
                            
                                pyautogui, screenshot function doesn't recognize installed Pillow module
                            
                                pybind how can I operate over a py::list object
                            
                                insert data from two lists to dict with for loop [duplicate]
                            
                                cropping an image in a circular way, using python [duplicate]
                            
                                Get "edge numbers" from list
                            
                                Get output from a non final keras model layer
                            
                                Plot loss evolution during a single epoch in Keras
                            
                                AWS / Python Lambda function checking if a query string is present
                            
                                Drop a dimension of a tensor in Tensorflow
                            
                                Write custom Data Generator for Keras

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Converting key=value pairs back into Python dicts

Tags:

python

string

dictionary

logging

parsing

wim

People also ask

4 Answers

user2357112 supports Monica

Jean-François Fabre

Ajax1234

Provide two helper functions.

Chew through string until there is no more string to chew

All tests passed

Deficiencies

All Together

piRSquared

Recent Activity

Donate For Us