I'm working with a fairly simple parser in PLY, and one of my rules takes on the following form:
def p_things(p):
'''
things : thing things
things : thing
'''
p[0] = [p[1]]
if len(p) == 3:
p[0] += p[2]
Input files are generally simple lists of thing
s, and so the parsing itself is not complex. Some of my input files are very large, however (exceeding 100,000 lines fairly regularly, and over 1,000,000 in extreme cases). In profiling (via cProfile and pstats), the bulk of the runtime is taken up by repeated calls to p_things
- presumably, one call for each item in a things
list.
Is there a way to reduce this time, or a more efficient way to structure this rule? Most answers I've seen so far (and the canonical compilers info I've found) have listed this method as the generally accepted way to construct a list of parseable items, no matter the length.
Turns out I'm forgetting some of my basic compilers theory. PLY is a LALR(1) parser, and so it's better to write the rule as:
def p_things(p):
'''
things : things thing
things : thing
'''
if len(p) == 2:
p[0] = [p[1]]
else:
p[0] = p[1]
p[0].append(p[2])
Though it may look more verbose, there's actually a significant improvement - somewhere in either PLY or Python, the parser was able to apply some optimization on the left-recursive form. I've seen performance drop from exponential to linear on my larger input files; one sample, with over a million items in the things
list, ran in under 20% of the time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With