I wrote a parser in python3.6; I simplified it as much as possible while still producing the bug:
def tokenize(expr):
for i in expr:
try:
yield int(i)
except ValueError:
yield i
def push_on_stream(obj, stream):
yield obj
yield from stream
class OpenBracket:
"just a token value, could have used Ellipsis"
pass
def parse_toks(tokstream):
result = []
leading_brak = False
for tok in tokstream:
if tok == OpenBracket:
leading_brak = True
elif tok == '(':
result.append(parse_toks(
push_on_stream(OpenBracket, tokstream)))
elif tok == ')':
if not leading_brak:
raise SyntaxError("Very bad ')'.")
break
else:
result.append(tok)
return sum(result)
def test(expr="12(34)21"):
tokens = tokenize(expr)
print( parse_toks(tokens) )
print(list(tokens))
test()
This example is trivial; the effect should be to add all the digits in a string, including digits in brackets.
A tokenize() function yields tokens and a parse_tok() function parses the token stream. If it comes across an open parenthesis, it recurses (pushing OpenBracket onto the token stream), which should have the effect of treating the digits in the parentheses as a separate expression, parsing it and adding the result to the result stack.
When I parser code, e.g. on the expression "1(2)3", it immediately ends after the close bracket, returning 3 and in fact the token stream seems to have ended.
When I run it using pdb however, and set breakpoints inside the loop in parse_tok, I can step carefully when it is processing the ')' and the program correctly returns 6.
I think the bug is something to do with yielding from the token stream in push_on_stream().
Is this a bug in the interpreter? If so is there a good workaround?
I wrote it for python-3.6, but I also tested it on python-3.7 on a different machine with the same result.
Your push_on_stream
doesn't quite work the way you think it should.
See, when the push_on_stream
generator is reclaimed, Python calls close
on the generator, which throws a GeneratorExit
into the generator to make sure any finally
blocks and __exit__
methods run. Since push_on_stream
uses yield from
on the underlying generator, if push_on_stream
is suspended in the yield from
, this throws a GeneratorExit
in the underlying tokenize
generator.
This immediately terminates the token stream. In pdb, something caused the push_on_stream
generator to not be collected, preventing this effect.
When the break
statement leaves the loop, a GeneratorExit
exception is raised which propogates through the generators. pdb
modifies how this propagates, which is exactly the sort of subtle bug I'd expect it to introduce, causing it to not exhaust the generator that push_on_stream
is yield
ing from.
If we change push_on_stream
from:
def push_on_stream(obj, stream):
yield obj
yield from stream
to:
def push_on_stream(obj, stream):
yield obj
stream = iter(stream)
while True:
yield next(stream)
then this will affect it enough to guarantee the correct behaviour in both cases.
Bug fixed!
Provided better by user2357112's answer. Basically, yield from
doesn't work the way you'd think it does; when the generator exits due to the break
statement, yield from
causes the generator you're iterating over to mark itself as exhausted. (pdb
interrupts this, because it's a slightly buggy pain.) This leads to your parser terminating at the first )
, because the underlying iterator is stopped when the first break
statement runs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With