There's a logfile with text in the form of space-separated key=value
pairs, and each line was originally serialized from data in a Python dict, something like:
' '.join([f'{k}={v!r}' for k,v in d.items()])
The keys are always just strings. The values could be anything that ast.literal_eval
can successfully parse, no more no less.
How to process this logfile and turn the lines back into Python dicts? Example:
>>> to_dict("key='hello world'")
{'key': 'hello world'}
>>> to_dict("k1='v1' k2='v2'")
{'k1': 'v1', 'k2': 'v2'}
>>> to_dict("s='1234' n=1234")
{'s': '1234', 'n': 1234}
>>> to_dict("""k4='k5="hello"' k5={'k6': ['potato']}""")
{'k4': 'k5="hello"', 'k5': {'k6': ['potato']}}
Here is some extra context about the data:
eval
, exec
, yaml.load
are OK to use)Edit: As requested in the comments, here is an MCVE and an example code that didn't work correctly
>>> def to_dict(s):
... s = s.replace(' ', ', ')
... return eval(f"dict({s})")
...
...
>>> to_dict("k1='v1' k2='v2'")
{'k1': 'v1', 'k2': 'v2'} # OK
>>> to_dict("s='1234' n=1234")
{'s': '1234', 'n': 1234} # OK
>>> to_dict("key='hello world'")
{'key': 'hello, world'} # Incorrect, the value was corrupted
Python dictionary is one of the built-in data types. Dictionary elements are key-value pairs. You can add to dictionary in Python using multiple methods.
Using an empty string, a for loop, and items() function In this method of converting a dictionary to a string, we will first access the key:value pair by iterating through the dictionary object using a for loop and items() function. Then we add each key:value pair to an empty string.
Your input can't be conveniently parsed by something like ast.literal_eval
, but it can be tokenized as a series of Python tokens. This makes things a bit easier than they might otherwise be.
The only place =
tokens can appear in your input is as key-value separators; at least for now, ast.literal_eval
doesn't accept anything with =
tokens in it. We can use the =
tokens to determine where the key-value pairs start and end, and most of the rest of the work can be handled by ast.literal_eval
. Using the tokenize
module also avoids problems with =
or backslash escapes in string literals.
import ast
import io
import tokenize
def todict(logstring):
# tokenize.tokenize wants an argument that acts like the readline method of a binary
# file-like object, so we have to do some work to give it that.
input_as_file = io.BytesIO(logstring.encode('utf8'))
tokens = list(tokenize.tokenize(input_as_file.readline))
eqsign_locations = [i for i, token in enumerate(tokens) if token[1] == '=']
names = [tokens[i-1][1] for i in eqsign_locations]
# Values are harder than keys.
val_starts = [i+1 for i in eqsign_locations]
val_ends = [i-1 for i in eqsign_locations[1:]] + [len(tokens)]
# tokenize.untokenize likes to add extra whitespace that ast.literal_eval
# doesn't like. Removing the row/column information from the token records
# seems to prevent extra leading whitespace, but the documentation doesn't
# make enough promises for me to be comfortable with that, so we call
# strip() as well.
val_strings = [tokenize.untokenize(tok[:2] for tok in tokens[start:end]).strip()
for start, end in zip(val_starts, val_ends)]
vals = [ast.literal_eval(val_string) for val_string in val_strings]
return dict(zip(names, vals))
This behaves correctly on your example inputs, as well as on an example with backslashes:
>>> todict("key='hello world'")
{'key': 'hello world'}
>>> todict("k1='v1' k2='v2'")
{'k1': 'v1', 'k2': 'v2'}
>>> todict("s='1234' n=1234")
{'s': '1234', 'n': 1234}
>>> todict("""k4='k5="hello"' k5={'k6': ['potato']}""")
{'k4': 'k5="hello"', 'k5': {'k6': ['potato']}}
>>> s=input()
a='=' b='"\'' c=3
>>> todict(s)
{'a': '=', 'b': '"\'', 'c': 3}
Incidentally, we probably could look for token type NAME instead of =
tokens, but that'll break if they ever add set()
support to literal_eval
. Looking for =
could also break in the future, but it doesn't seem as likely to break as looking for NAME
tokens.
Regex replacement functions to the rescue
I'm not rewriting a ast-like parser for you, but one trick that works pretty well is to use regular expressions to replace the quoted strings and replace them by "variables" (I've chosen __token(number)__
), a bit like you're offuscating some code.
Make a note of the strings you're replacing (that should take care of the spaces), replace space by comma (protecting against symbols before like :
allows to pass last test) and replace by strings again.
import re,itertools
def to_dict(s):
rep_dict = {}
cnt = itertools.count()
def rep_func(m):
rval = "__token{}__".format(next(cnt))
rep_dict[rval] = m.group(0)
return rval
# replaces single/double quoted strings by token variable-like idents
# going on a limb to support escaped quotes in the string and double escapes at the end of the string
s = re.sub(r"(['\"]).*?([^\\]|\\\\)\1",rep_func,s)
# replaces spaces that follow a letter/digit/underscore by comma
s = re.sub("(\w)\s+",r"\1,",s)
#print("debug",s) # uncomment to see temp string
# put back the original strings
s = re.sub("__token\d+__",lambda m : rep_dict[m.group(0)],s)
return eval("dict({s})".format(s=s))
print(to_dict("k1='v1' k2='v2'"))
print(to_dict("s='1234' n=1234"))
print(to_dict(r"key='hello world'"))
print(to_dict('key="hello world"'))
print(to_dict("""k4='k5="hello"' k5={'k6': ['potato']}"""))
# extreme string test
print(to_dict(r"key='hello \'world\\'"))
prints:
{'k2': 'v2', 'k1': 'v1'}
{'n': 1234, 's': '1234'}
{'key': 'hello world'}
{'key': 'hello world'}
{'k5': {'k6': ['potato']}, 'k4': 'k5="hello"'}
{'key': "hello 'world\\"}
The key is to extract the strings (quoted/double quoted) using non-greedy regex and replace them by non-strings (like if those were string variables not literals) in the expression. The regex has been tuned so it can accept escaped quotes and double escape at the end of string (custom solution)
The replacement function is an inner function so it can make use of the nonlocal dictionary & counter and track the replaced text, so it can be restored once the spaces have been taken care of.
When replacing the spaces by commas, you have to be careful not to do it after a colon (last test) or all things considered after a alphanum/underscore (hence the \w
protection in the replacement regex for comma)
If we uncomment the debug print code just before the original strings are put back that prints:
debug k1=__token0__,k2=__token1__
debug s=__token0__,n=1234
debug key=__token0__
debug k4=__token0__,k5={__token1__: [__token2__]}
debug key=__token0__
The strings have been pwned, and the replacement of spaces has worked properly. With some more effort, it should probably be possible to quote the keys and replace k1=
by "k1":
so ast.literal_eval
can be used instead of eval
(more risky, and not required here)
I'm sure some super-complex expressions can break my code (I've even heard that there are very few json parsers able to parse 100% of the valid json files), but for the tests you submitted, it'll work (of course if some funny guy tries to put __tokenxx__
idents in the original strings, that'll fail, maybe it could be replaced by some otherwise invalid-as-variable placeholders). I have built an Ada lexer using this technique some time ago to be able to avoid spaces in strings and that worked pretty well.
You can find all the occurrences of =
characters, and then find the maximum runs of characters which give a valid ast.literal_eval
result. Those characters can then be parsed for the value, associated with a key found by a string slice between the last successful parse and the index of the current =
:
import ast, typing
def is_valid(_str:str) -> bool:
try:
_ = ast.literal_eval(_str)
except:
return False
else:
return True
def parse_line(_d:str) -> typing.Generator[typing.Tuple, None, None]:
_eq, last = [i for i, a in enumerate(_d) if a == '='], 0
for _loc in _eq:
if _loc >= last:
_key = _d[last:_loc]
_inner, seen, _running, _worked = _loc+1, '', _loc+2, []
while True:
try:
val = ast.literal_eval(_d[_inner:_running])
except:
_running += 1
else:
_max = max([i for i in range(len(_d[_inner:])) if is_valid(_d[_inner:_running+i])])
yield (_key, ast.literal_eval(_d[_inner:_running+_max]))
last = _running+_max
break
def to_dict(_d:str) -> dict:
return dict(parse_line(_d))
print([to_dict("key='hello world'"),
to_dict("k1='v1' k2='v2'"),
to_dict("s='1234' n=1234"),
to_dict("""k4='k5="hello"' k5={'k6': ['potato']}"""),
to_dict("val=['100', 100, 300]"),
to_dict("val=[{'t':{32:45}, 'stuff':100, 'extra':[]}, 100, 300]")
]
)
Output:
{'key': 'hello world'}
{'k1': 'v1', 'k2': 'v2'}
{'s': '1234', 'n': 1234}
{'k4': 'k5="hello"', 'k5': {'k6': ['potato']}}
{'val': ['100', 100, 300]}
{'val': [{'t': {32: 45}, 'stuff': 100, 'extra': []}, 100, 300]}
Disclaimer:
This solution is not as elegant as @Jean-FrançoisFabre's, and I am not sure if it can parse 100% of what is passed to to_dict
, but it may give you inspiration for your own version.
popstr
: split thing from start of string that looks like string
If it starts with a single or double quote mark, I'll look for the next one and split at that point.
def popstr(s):
i = s[1:].find(s[0]) + 2
return s[:i], s[i:]
poptrt
: split thing from start of string that is surrounded by brackets ('[]', '()', '{}').
If it starts with a bracket, I'll start incrementing for every instance of the starting character and decrementing for every instance of it's complement. When I reach zero, I split.
def poptrt(s): d = {'{': '}', '[': ']', '(': ')'} b = s[0] c = lambda x: {b: 1, d[b]: -1}.get(x, 0) parts = [] t, i = 1, 1 while t > 0 and s: if i > len(s) - 1: break elif s[i] in '\'"': s, s, s = s[:i], *map(str.strip, popstr(s[i:])) parts.extend([s, s]) i = 0 else: t += c(s[i]) i += 1 if t == 0: return ''.join(parts + [s[:i]]), s[i:] else: raise ValueError('Your string has unbalanced brackets.')
def to_dict(log):
d = {}
while log:
k, log = map(str.strip, log.split('=', 1))
if log.startswith(('"', "'")):
v, log = map(str.strip, popstr(log))
elif log.startswith((*'{[(',)):
v, log = map(str.strip, poptrt(log))
else:
v, *log = map(str.strip, log.split(None, 1))
log = ' '.join(log)
d[k] = ast.literal_eval(v)
return d
assert to_dict("key='hello world'") == {'key': 'hello world'}
assert to_dict("k1='v1' k2='v2'") == {'k1': 'v1', 'k2': 'v2'}
assert to_dict("s='1234' n=1234") == {'s': '1234', 'n': 1234}
assert to_dict("""k4='k5="hello"' k5={'k6': ['potato']}""") == {'k4': 'k5="hello"', 'k5': {'k6': ['potato']}}
import ast
def popstr(s):
i = s[1:].find(s[0]) + 2
return s[:i], s[i:]
def poptrt(s):
d = {'{': '}', '[': ']', '(': ')'}
b = s[0]
c = lambda x: {b: 1, d[b]: -1}.get(x, 0)
parts = []
t, i = 1, 1
while t > 0 and s:
if i > len(s) - 1:
break
elif s[i] in '\'"':
_s, s_, s = s[:i], *map(str.strip, popstr(s[i:]))
parts.extend([_s, s_])
i = 0
else:
t += c(s[i])
i += 1
if t == 0:
return ''.join(parts + [s[:i]]), s[i:]
else:
raise ValueError('Your string has unbalanced brackets.')
def to_dict(log):
d = {}
while log:
k, log = map(str.strip, log.split('=', 1))
if log.startswith(('"', "'")):
v, log = map(str.strip, popstr(log))
elif log.startswith((*'{[(',)):
v, log = map(str.strip, poptrt(log))
else:
v, *log = map(str.strip, log.split(None, 1))
log = ' '.join(log)
d[k] = ast.literal_eval(v)
return d
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With