Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing words into (prefix, root, suffix) in Python

I'm trying to create a simple parser for some text data. (The text is in a language that NLTK doesn't have any parsers for.)

Basically, I have a limited number of prefixes, which can be either one or two letters; a word can have more than one prefix. I also have a limited number of suffixes of one or two letters. Whatever's in between them should be the "root" of the word. Many words will have more the one possible parsing, so I want to input a word and get back a list of possible parses in the form of a tuple (prefix,root,suffix).

I can't figure out how to structure the code though. I pasted an example of one way I tried (using some dummy English data to make it more understandable), but it's clearly not right. For one thing it's really ugly and redundant, so I'm sure there's a better way to do it. For another, it doesn't work with words that have more than one prefix or suffix, or both prefix(es) and suffix(es).

Any thoughts?

prefixes = ['de','con']
suffixes = ['er','s']

def parser(word):
    poss_parses = []
    if word[0:2] in prefixes:
        poss_parses.append((word[0:2],word[2:],''))
    if word[0:3] in prefixes:
        poss_parses.append((word[0:3],word[3:],''))
    if word[-2:-1] in prefixes:
        poss_parses.append(('',word[:-2],word[-2:-1]))
    if word[-3:-1] in prefixes:
        poss_parses.append(('',word[:-3],word[-3:-1]))
    if word[0:2] in prefixes and word[-2:-1] in suffixes and len(word[2:-2])>2:
        poss_parses.append((word[0:2],word[2:-2],word[-2:-1]))
    if word[0:2] in prefixes and word[-3:-1] in suffixes and len(word[2:-3])>2:
        poss_parses.append((word[0:2],word[2:-2],word[-3:-1]))
    if word[0:3] in prefixes and word[-2:-1] in suffixes and len(word[3:-2])>2:
        poss_parses.append((word[0:2],word[2:-2],word[-2:-1]))
    if word[0:3] in prefixes and word[-3:-1] in suffixes and len(word[3:-3])>2:
        poss_parses.append((word[0:3],word[3:-2],word[-3:-1]))
    return poss_parses



>>> wordlist = ['construct','destructer','constructs','deconstructs']
>>> for w in wordlist:
...   parses = parser(w)
...   print w
...   for p in parses:
...     print p
... 
construct
('con', 'struct', '')
destructer
('de', 'structer', '')
constructs
('con', 'structs', '')
deconstructs
('de', 'constructs', '')
like image 801
larapsodia Avatar asked Apr 14 '12 19:04

larapsodia


People also ask

How do you remove prefix and suffix in Python?

Use the str. removeprefix() and str. removesuffix() methods to remove the prefix and suffix from a string.


2 Answers

Pyparsing wraps the string indexing and token extracting into its own parsing framework, and allows you to use simple arithmetic syntax to build up your parsing definitions:

wordlist = ['construct','destructer','constructs','deconstructs']

from pyparsing import StringEnd, oneOf, FollowedBy, Optional, ZeroOrMore, SkipTo

endOfString = StringEnd()
prefix = oneOf("de con")
suffix = oneOf("er s") + FollowedBy(endOfString)

word = (ZeroOrMore(prefix)("prefixes") + 
        SkipTo(suffix | endOfString)("root") + 
        Optional(suffix)("suffix"))

for wd in wordlist:
    print wd
    res = word.parseString(wd)
    print res.dump()
    print res.prefixes
    print res.root
    print res.suffix
    print

The results are returned in a rich object called ParseResults, which can be accessed as a simple list, as an object with named attributes, or as a dict. The output from this program is:

construct
['con', 'struct']
- prefixes: ['con']
- root: struct
['con']
struct


destructer
['de', 'struct', 'er']
- prefixes: ['de']
- root: struct
- suffix: ['er']
['de']
struct
['er']

constructs
['con', 'struct', 's']
- prefixes: ['con']
- root: struct
- suffix: ['s']
['con']
struct
['s']

deconstructs
['de', 'con', 'struct', 's']
- prefixes: ['de', 'con']
- root: struct
- suffix: ['s']
['de', 'con']
struct
['s']
like image 162
PaulMcG Avatar answered Sep 20 '22 11:09

PaulMcG


Here is my solution:

prefixes = ['de','con']
suffixes = ['er','s']

def parse(word):
    prefix = ''
    suffix = ''

    # find all prefixes
    found = True
    while found:
        found = False
        for p in prefixes:
            if word.startswith(p):
                prefix += p
                word = word[len(p):] # remove prefix from word
                found = True

    # find all suffixes
    found = True
    while found:
        found = False
        for s in suffixes:
            if word.endswith(s):
                suffix = s + suffix
                word = word[:-len(s)] # remove suffix from word
                found = True

    return (prefix, word, suffix)

print parse('construct')
print parse ('destructer')
print parse('deconstructs')
print parse('deconstructers')
print parse('deconstructser')
print parse('condestructser')

Result:

>>> 
('con', 'struct', '')
('de', 'struct', 'er')
('decon', 'struct', 's')
('decon', 'struct', 'ers')
('decon', 'struct', 'ser')
('conde', 'struct', 'ser')

The idea is to loop through all prefixes and aggregate them, and at the same time remove them from the word. The tricky part is that the order in which the prefixes are defined may hide prefixes from being found, so the iterations must be re-invoked until all prefixes are found.

The same goes for suffixes, except that we build the suffix word in reverse order.

like image 25
Israel Unterman Avatar answered Sep 19 '22 11:09

Israel Unterman