Parsing words into (prefix, root, suffix) in Python

Tags:

I'm trying to create a simple parser for some text data. (The text is in a language that NLTK doesn't have any parsers for.)

Basically, I have a limited number of prefixes, which can be either one or two letters; a word can have more than one prefix. I also have a limited number of suffixes of one or two letters. Whatever's in between them should be the "root" of the word. Many words will have more the one possible parsing, so I want to input a word and get back a list of possible parses in the form of a tuple (prefix,root,suffix).

I can't figure out how to structure the code though. I pasted an example of one way I tried (using some dummy English data to make it more understandable), but it's clearly not right. For one thing it's really ugly and redundant, so I'm sure there's a better way to do it. For another, it doesn't work with words that have more than one prefix or suffix, or both prefix(es) and suffix(es).

Any thoughts?

prefixes = ['de','con']
suffixes = ['er','s']

def parser(word):
    poss_parses = []
    if word[0:2] in prefixes:
        poss_parses.append((word[0:2],word[2:],''))
    if word[0:3] in prefixes:
        poss_parses.append((word[0:3],word[3:],''))
    if word[-2:-1] in prefixes:
        poss_parses.append(('',word[:-2],word[-2:-1]))
    if word[-3:-1] in prefixes:
        poss_parses.append(('',word[:-3],word[-3:-1]))
    if word[0:2] in prefixes and word[-2:-1] in suffixes and len(word[2:-2])>2:
        poss_parses.append((word[0:2],word[2:-2],word[-2:-1]))
    if word[0:2] in prefixes and word[-3:-1] in suffixes and len(word[2:-3])>2:
        poss_parses.append((word[0:2],word[2:-2],word[-3:-1]))
    if word[0:3] in prefixes and word[-2:-1] in suffixes and len(word[3:-2])>2:
        poss_parses.append((word[0:2],word[2:-2],word[-2:-1]))
    if word[0:3] in prefixes and word[-3:-1] in suffixes and len(word[3:-3])>2:
        poss_parses.append((word[0:3],word[3:-2],word[-3:-1]))
    return poss_parses



>>> wordlist = ['construct','destructer','constructs','deconstructs']
>>> for w in wordlist:
...   parses = parser(w)
...   print w
...   for p in parses:
...     print p
... 
construct
('con', 'struct', '')
destructer
('de', 'structer', '')
constructs
('con', 'structs', '')
deconstructs
('de', 'constructs', '')

801

asked Apr 14 '12 19:04

larapsodia

2 Answers

Pyparsing wraps the string indexing and token extracting into its own parsing framework, and allows you to use simple arithmetic syntax to build up your parsing definitions:

wordlist = ['construct','destructer','constructs','deconstructs']

from pyparsing import StringEnd, oneOf, FollowedBy, Optional, ZeroOrMore, SkipTo

endOfString = StringEnd()
prefix = oneOf("de con")
suffix = oneOf("er s") + FollowedBy(endOfString)

word = (ZeroOrMore(prefix)("prefixes") + 
        SkipTo(suffix | endOfString)("root") + 
        Optional(suffix)("suffix"))

for wd in wordlist:
    print wd
    res = word.parseString(wd)
    print res.dump()
    print res.prefixes
    print res.root
    print res.suffix
    print

The results are returned in a rich object called ParseResults, which can be accessed as a simple list, as an object with named attributes, or as a dict. The output from this program is:

construct
['con', 'struct']
- prefixes: ['con']
- root: struct
['con']
struct


destructer
['de', 'struct', 'er']
- prefixes: ['de']
- root: struct
- suffix: ['er']
['de']
struct
['er']

constructs
['con', 'struct', 's']
- prefixes: ['con']
- root: struct
- suffix: ['s']
['con']
struct
['s']

deconstructs
['de', 'con', 'struct', 's']
- prefixes: ['de', 'con']
- root: struct
- suffix: ['s']
['de', 'con']
struct
['s']

162

answered Sep 20 '22 11:09

PaulMcG

Here is my solution:

prefixes = ['de','con']
suffixes = ['er','s']

def parse(word):
    prefix = ''
    suffix = ''

    # find all prefixes
    found = True
    while found:
        found = False
        for p in prefixes:
            if word.startswith(p):
                prefix += p
                word = word[len(p):] # remove prefix from word
                found = True

    # find all suffixes
    found = True
    while found:
        found = False
        for s in suffixes:
            if word.endswith(s):
                suffix = s + suffix
                word = word[:-len(s)] # remove suffix from word
                found = True

    return (prefix, word, suffix)

print parse('construct')
print parse ('destructer')
print parse('deconstructs')
print parse('deconstructers')
print parse('deconstructser')
print parse('condestructser')

Result:

>>> 
('con', 'struct', '')
('de', 'struct', 'er')
('decon', 'struct', 's')
('decon', 'struct', 'ers')
('decon', 'struct', 'ser')
('conde', 'struct', 'ser')

The idea is to loop through all prefixes and aggregate them, and at the same time remove them from the word. The tricky part is that the order in which the prefixes are defined may hide prefixes from being found, so the iterations must be re-invoked until all prefixes are found.

The same goes for suffixes, except that we build the suffix word in reverse order.

answered Sep 19 '22 11:09

Israel Unterman

Related questions
                            
                                boto.s3: copy() on a key object loses 'Content-Type' metadata
                            
                                Monkey-patch a builtin function for a unit-test?
                            
                                Django Testing: no data in temporary database file
                            
                                internal reference prevents garbage collection
                            
                                Converting ndarray generated by hcluster into a Newick string for use with ete2 package
                            
                                Generate Zip Files and Store in GAE BlobStore
                            
                                Pickling a graph with cycles
                            
                                Recommended Python Modules for Function Argument Handling?
                            
                                OpenCV Python Bindings for GrabCut Algorithm
                            
                                SWIG C++ Python polymorphism and multi-threading
                            
                                Use blocks from included files for parent in jinja2
                            
                                py2app picking up .git subdir of a package during build
                            
                                Can't get scipy hierarchical clustering to work
                            
                                Can I install the "scraperwiki" library locally?
                            
                                How to convert 3 lists into 1 3D Numpy array
                            
                                The value of an empty list in function parameter, example here [duplicate]
                            
                                Need algorithm suggestions for flight routings
                            
                                python BeautifulSoup searching a tag
                            
                                PyCrypto: Generate RSA key protected with DES3 password
                            
                                Convert plain text to PDF in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parsing words into (prefix, root, suffix) in Python

Tags:

python

parsing

nlp

larapsodia

People also ask

2 Answers

PaulMcG

Israel Unterman

Recent Activity

Donate For Us