I have a string in python of the following form:
line a
line b
line ba
line bb
line bba
line bc
line c
line ca
line caa
line d
You can get the idea. It actually takes a very similar form to python code itself, in that there is a line, and below that line, indentations indicate part of a block, headed by the most recent line of a lesser indentation.
What I need to do is parse this code into a tree sructure, such that each root level line is the key of a dictionary, and its value is a dictionary representing all sublines. so the above would be:
{
'line a' => {},
'line b' => {
'line ba' => {},
'line bb' => {
'line bba' => {}
},
'line bc' => {}
},
'line c' => {
'line ca' => {
'line caa' => {}
},
},
'line d' => {}
}
here's what I've got:
def parse_message_to_tree(message):
buf = StringIO(message)
return parse_message_to_tree_helper(buf, 0)
def parse_message_to_tree_helper(buf, prev):
ret = {}
for line in buf:
line = line.rstrip()
index = len(line) - len(line.lstrip())
print (line + " => " + str(index))
if index > prev:
ret[line.strip()] = parse_message_to_tree_helper(buf, index)
else:
ret[line.strip()] = {}
return ret
The print shows lines that are left stripped and indexes of 0. I didn't think lstrip()
was a mutator, but either way the index should still be accurate.
Any suggestions are helpful.
EDIT: Not sure what was going wrong before, but I tried again and it is closer to working, but still not quite right. Here's what I have now:
{'line a': {},
'line b': {},
'line ba': {'line bb': {},
'line bba': {'line bc': {},
'line c': {},
'line ca': {},
'line caa': {},
'line d': {}}}}
Like already noted before str.lstrip()
is not a mutator, The index is coming accurate in my system as well.
But the issue is that by the time you realize that the index for the line has increased, line
is actually point to the increased index line , example , in first case, we note that index for line increases at line ba
, so line
points to line ba
, and then in your if
condition , you do -
ret[line.strip()] = parse_message_to_tree_helper(buf, index)
This is wrong, because you would be setting whatever is returned by parse_message_to_tree_helper()
to line ba
, not its actual parent.
Also, once you recurse inside the function, you do not come out unless the file has completely been read, but the level in which a certain line is stored in dictionary depends on it coming out of recursion when the indentation has decreased.
I am not sure, if there are any inbuilt libraries that will help you do this , but a code that I was able to come up with (based a lot on your code) -
def parse_message_to_tree(message):
buf = StringIO(message)
return parse_message_to_tree_helper(buf, 0, None)[0]
def parse_message_to_tree_helper(buf, prev, prevline):
ret = {}
index = -1
for line in buf:
line = line.rstrip()
index = len(line) - len(line.lstrip())
print (line + " => " + str(index))
if index > prev:
ret[prevline.strip()],prevline,index = parse_message_to_tree_helper(buf, index, line)
if index < prev:
return ret,prevline,index
continue
elif not prevline:
ret[line.strip()] = {}
else:
ret[prevline.strip()] = {}
if index < prev:
return ret,line,index
prevline = line
if index == -1:
ret[prevline.strip()] = {}
return ret,None,index
if prev == index:
ret[prevline.strip()] = {}
return ret,None,0
Example/Demo -
>>> print(s)
line a
line b
line ba
line bb
line bba
line bc
line c
line ca
line caa
>>> def parse_message_to_tree(message):
... buf = StringIO(message)
... return parse_message_to_tree_helper(buf, 0, None)[0]
...
>>> def parse_message_to_tree_helper(buf, prev, prevline):
... ret = {}
... index = -1
... for line in buf:
... line = line.rstrip()
... index = len(line) - len(line.lstrip())
... print (line + " => " + str(index))
... if index > prev:
... ret[prevline.strip()],prevline,index = parse_message_to_tree_helper(buf, index, line)
... if index < prev:
... return ret,prevline,index
... continue
... elif not prevline:
... ret[line.strip()] = {}
... else:
... ret[prevline.strip()] = {}
... if index < prev:
... return ret,line,index
... prevline = line
... if index == -1:
... ret[prevline.strip()] = {}
... return ret,None,index
... if prev == index:
... ret[prevline.strip()] = {}
... return ret,None,0
...
>>> pprint.pprint(parse_message_to_tree(s))
line a => 0
line b => 0
line ba => 2
line bb => 2
line bba => 4
line bc => 2
line c => 0
line ca => 2
line caa => 4
{'line a': {},
'line b': {'line ba': {}, 'line bb': {'line bba': {}}, 'line bc': {}},
'line c': {'line ca': {'line caa': {}}}}
>>> s = """line a
... line b
... line ba
... line bb
... line bba
... line bc
... line c
... line ca
... line caa
... line d"""
>>> pprint.pprint(parse_message_to_tree(s))
line a => 0
line b => 0
line ba => 2
line bb => 2
line bba => 4
line bc => 2
line c => 0
line ca => 2
line caa => 4
line d => 0
{'line a': {},
'line b': {'line ba': {}, 'line bb': {'line bba': {}}, 'line bc': {}},
'line c': {'line ca': {'line caa': {}}},
'line d': {}}
You would need to test the code for any more bugs or some missed cases.
lstrip()
isn't a mutator, see documentation:
string.lstrip(s[, chars])
Return a copy of the string with leading characters removed. If chars is omitted or None, whitespace characters are removed. If given and not None, chars must be a string; the characters in the string will be stripped from the beginning of the string this method is called on.
And your code seems to work with that sample text on my machine.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With