How best to parse a simple grammar?

Tags:

Ok, so I've asked a bunch of smaller questions about this project, but I still don't have much confidence in the designs I'm coming up with, so I'm going to ask a question on a broader scale.

I am parsing pre-requisite descriptions for a course catalog. The descriptions almost always follow a certain form, which makes me think I can parse most of them.

From the text, I would like to generate a graph of course pre-requisite relationships. (That part will be easy, after I have parsed the data.)

Some sample inputs and outputs:

"CS 2110" => ("CS", 2110) # 0  "CS 2110 and INFO 3300" => [("CS", 2110), ("INFO", 3300)] # 1 "CS 2110, INFO 3300" => [("CS", 2110), ("INFO", 3300)] # 1 "CS 2110, 3300, 3140" => [("CS", 2110), ("CS", 3300), ("CS", 3140)] # 1  "CS 2110 or INFO 3300" => [[("CS", 2110)], [("INFO", 3300)]] # 2  "MATH 2210, 2230, 2310, or 2940" => [[("MATH", 2210), ("MATH", 2230), ("MATH", 2310)], [("MATH", 2940)]] # 3

If the entire description is just a course, it is output directly.
If the courses are conjoined ("and"), they are all output in the same list
If the course are disjoined ("or"), they are in separate lists
Here, we have both "and" and "or".

One caveat that makes it easier: it appears that the nesting of "and"/"or" phrases is never greater than as shown in example 3.

What is the best way to do this? I started with PLY, but I couldn't figure out how to resolve the reduce/reduce conflicts. The advantage of PLY is that it's easy to manipulate what each parse rule generates:

def p_course(p):  'course : DEPT_CODE COURSE_NUMBER'  p[0] = (p[1], int(p[2]))

With PyParse, it's less clear how to modify the output of parseString(). I was considering building upon @Alex Martelli's idea of keeping state in an object and building up the output from that, but I'm not sure exactly how that is best done.

 def addCourse(self, str, location, tokens):   self.result.append((tokens[0][0], tokens[0][1]))   def makeCourseList(self, str, location, tokens):    dept = tokens[0][0]   new_tokens = [(dept, tokens[0][1])]   new_tokens.extend((dept, tok) for tok in tokens[1:])    self.result.append(new_tokens)

For instance, to handle "or" cases:

    def __init__(self):             self.result = []             # ...   self.statement = (course_data + Optional(OR_CONJ + course_data)).setParseAction(self.disjunctionCourses)     def disjunctionCourses(self, str, location, tokens):   if len(tokens) == 1:    return tokens    print "disjunction tokens: %s" % tokens

How does disjunctionCourses() know which smaller phrases to disjoin? All it gets is tokens, but what's been parsed so far is stored in result, so how can the function tell which data in result corresponds to which elements of token? I guess I could search through the tokens, then find an element of result with the same data, but that feel convoluted...

Also, there are many descriptions that include misc text, like:

"CS 2110 or permission of instructor" "INFO 3140 or equivalent experience" "PYSCH 2210 and sophomore standing"

But it isn't critical that I parse that text.

What's a better way to approach this problem?

347

asked May 31 '10 18:05

Nick Heiner

1 Answers

def parse(astr):     astr=astr.replace(',','')     astr=astr.replace('and','')         tokens=astr.split()     dept=None     number=None     result=[]     option=[]     for tok in tokens:         if tok=='or':             result.append(option)             option=[]             continue         if tok.isalpha():             dept=tok             number=None         else:             number=int(tok)         if dept and number:             option.append((dept,number))     else:         if option:             result.append(option)     return result  if __name__=='__main__':     tests=[ ("CS 2110" , [[("CS", 2110)]]),             ("CS 2110 and INFO 3300" , [[("CS", 2110), ("INFO", 3300)]]),             ("CS 2110, INFO 3300" , [[("CS", 2110), ("INFO", 3300)]]),             ("CS 2110, 3300, 3140", [[("CS", 2110), ("CS", 3300), ("CS", 3140)]]),             ("CS 2110 or INFO 3300", [[("CS", 2110)], [("INFO", 3300)]]),             ("MATH 2210, 2230, 2310, or 2940", [[("MATH", 2210), ("MATH", 2230), ("MATH", 2310)], [("MATH", 2940)]])]      for test,answer in tests:         result=parse(test)         if result==answer:             print('GOOD: {0} => {1}'.format(test,answer))         else:             print('ERROR: {0} => {1} != {2}'.format(test,result,answer))             break

yields

GOOD: CS 2110 => [[('CS', 2110)]] GOOD: CS 2110 and INFO 3300 => [[('CS', 2110), ('INFO', 3300)]] GOOD: CS 2110, INFO 3300 => [[('CS', 2110), ('INFO', 3300)]] GOOD: CS 2110, 3300, 3140 => [[('CS', 2110), ('CS', 3300), ('CS', 3140)]] GOOD: CS 2110 or INFO 3300 => [[('CS', 2110)], [('INFO', 3300)]] GOOD: MATH 2210, 2230, 2310, or 2940 => [[('MATH', 2210), ('MATH', 2230), ('MATH', 2310)], [('MATH', 2940)]]

181

answered Oct 12 '22 07:10

unutbu

Related questions
                            
                                conda install python=3.6 UnsatisfiableError
                            
                                Can I test AWS Glue code locally?
                            
                                Solving "500: Internal Server Error, nbconvert failed: xelatex not found in PATH"
                            
                                Why don't Python sets preserve insertion order?
                            
                                Emacs and Python
                            
                                What in the world is the attribute "__class__" in python
                            
                                What is the hard recursion limit for Linux, Mac and Windows?
                            
                                What does this "-" in jinja2 template engine do?
                            
                                How can I compare a unicode type to a string in python?
                            
                                Difference between python - getmtime() and getctime() in unix system
                            
                                Pandas groupby(),agg() - how to return results without the multi index?
                            
                                Pythonic way to compose context managers for objects owned by a class
                            
                                Nested inlines in the Django admin?
                            
                                How to handle IncompleteRead: in python
                            
                                In Scipy how and why does curve_fit calculate the covariance of the parameter estimates
                            
                                shebang: use interpreter relative to the script path
                            
                                sklearn agglomerative clustering linkage matrix
                            
                                PIL cannot identify image file for io.BytesIO object
                            
                                What is the difference between OneVsRestClassifier and MultiOutputClassifier in scikit learn?
                            
                                blocks - send input to python subprocess pipeline

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How best to parse a simple grammar?

Tags:

python

parsing

nlp

pyparsing

ply

Nick Heiner

People also ask

1 Answers

unutbu

Recent Activity

Donate For Us