I'm trying to create a simple parser for some text data. (The text is in a language that NLTK doesn't have any parsers for.)
Basically, I have a limited number of prefixes, which can be either one or two letters; a word can have more than one prefix. I also have a limited number of suffixes of one or two letters. Whatever's in between them should be the "root" of the word. Many words will have more the one possible parsing, so I want to input a word and get back a list of possible parses in the form of a tuple (prefix,root,suffix).
I can't figure out how to structure the code though. I pasted an example of one way I tried (using some dummy English data to make it more understandable), but it's clearly not right. For one thing it's really ugly and redundant, so I'm sure there's a better way to do it. For another, it doesn't work with words that have more than one prefix or suffix, or both prefix(es) and suffix(es).
Any thoughts?
prefixes = ['de','con']
suffixes = ['er','s']
def parser(word):
poss_parses = []
if word[0:2] in prefixes:
poss_parses.append((word[0:2],word[2:],''))
if word[0:3] in prefixes:
poss_parses.append((word[0:3],word[3:],''))
if word[-2:-1] in prefixes:
poss_parses.append(('',word[:-2],word[-2:-1]))
if word[-3:-1] in prefixes:
poss_parses.append(('',word[:-3],word[-3:-1]))
if word[0:2] in prefixes and word[-2:-1] in suffixes and len(word[2:-2])>2:
poss_parses.append((word[0:2],word[2:-2],word[-2:-1]))
if word[0:2] in prefixes and word[-3:-1] in suffixes and len(word[2:-3])>2:
poss_parses.append((word[0:2],word[2:-2],word[-3:-1]))
if word[0:3] in prefixes and word[-2:-1] in suffixes and len(word[3:-2])>2:
poss_parses.append((word[0:2],word[2:-2],word[-2:-1]))
if word[0:3] in prefixes and word[-3:-1] in suffixes and len(word[3:-3])>2:
poss_parses.append((word[0:3],word[3:-2],word[-3:-1]))
return poss_parses
>>> wordlist = ['construct','destructer','constructs','deconstructs']
>>> for w in wordlist:
... parses = parser(w)
... print w
... for p in parses:
... print p
...
construct
('con', 'struct', '')
destructer
('de', 'structer', '')
constructs
('con', 'structs', '')
deconstructs
('de', 'constructs', '')
Use the str. removeprefix() and str. removesuffix() methods to remove the prefix and suffix from a string.
Pyparsing wraps the string indexing and token extracting into its own parsing framework, and allows you to use simple arithmetic syntax to build up your parsing definitions:
wordlist = ['construct','destructer','constructs','deconstructs']
from pyparsing import StringEnd, oneOf, FollowedBy, Optional, ZeroOrMore, SkipTo
endOfString = StringEnd()
prefix = oneOf("de con")
suffix = oneOf("er s") + FollowedBy(endOfString)
word = (ZeroOrMore(prefix)("prefixes") +
SkipTo(suffix | endOfString)("root") +
Optional(suffix)("suffix"))
for wd in wordlist:
print wd
res = word.parseString(wd)
print res.dump()
print res.prefixes
print res.root
print res.suffix
print
The results are returned in a rich object called ParseResults, which can be accessed as a simple list, as an object with named attributes, or as a dict. The output from this program is:
construct
['con', 'struct']
- prefixes: ['con']
- root: struct
['con']
struct
destructer
['de', 'struct', 'er']
- prefixes: ['de']
- root: struct
- suffix: ['er']
['de']
struct
['er']
constructs
['con', 'struct', 's']
- prefixes: ['con']
- root: struct
- suffix: ['s']
['con']
struct
['s']
deconstructs
['de', 'con', 'struct', 's']
- prefixes: ['de', 'con']
- root: struct
- suffix: ['s']
['de', 'con']
struct
['s']
Here is my solution:
prefixes = ['de','con']
suffixes = ['er','s']
def parse(word):
prefix = ''
suffix = ''
# find all prefixes
found = True
while found:
found = False
for p in prefixes:
if word.startswith(p):
prefix += p
word = word[len(p):] # remove prefix from word
found = True
# find all suffixes
found = True
while found:
found = False
for s in suffixes:
if word.endswith(s):
suffix = s + suffix
word = word[:-len(s)] # remove suffix from word
found = True
return (prefix, word, suffix)
print parse('construct')
print parse ('destructer')
print parse('deconstructs')
print parse('deconstructers')
print parse('deconstructser')
print parse('condestructser')
Result:
>>>
('con', 'struct', '')
('de', 'struct', 'er')
('decon', 'struct', 's')
('decon', 'struct', 'ers')
('decon', 'struct', 'ser')
('conde', 'struct', 'ser')
The idea is to loop through all prefixes and aggregate them, and at the same time remove them from the word. The tricky part is that the order in which the prefixes are defined may hide prefixes from being found, so the iterations must be re-invoked until all prefixes are found.
The same goes for suffixes, except that we build the suffix word in reverse order.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With