Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding combinations of stems and endings

Tags:

python

I have mappings of "stems" and "endings" (may not be the correct words) that look like so:

all_endings = {
 'birth': set(['place', 'day', 'mark']), 
 'snow': set(['plow', 'storm', 'flake', 'man']),
 'shoe': set(['lace', 'string', 'maker']),
 'lock': set(['down', 'up', 'smith']),
 'crack': set(['down', 'up',]),
 'arm': set(['chair']),
 'high': set(['chair']),
 'over': set(['charge']),
 'under': set(['charge']),
}

But much longer, of course. I also made the corresponding dictionary the other way around:

all_stems = {
 'chair': set(['high', 'arm']),
 'charge': set(['over', 'under']),
 'up': set(['lock', 'crack', 'vote']),
 'down': set(['lock', 'crack', 'fall']),
 'smith': set(['lock']),
 'place': set(['birth']),
 'day': set(['birth']),
 'mark': set(['birth']),
 'plow': set(['snow']),
 'storm': set(['snow']),
 'flake': set(['snow']),
 'man': set(['snow']),
 'lace': set(['shoe']),
 'string': set(['shoe']),
 'maker': set(['shoe']),
}

I've now tried to come up with an algorithm to find any match of two or more "stems" that match two or more "endings". Above, for example, it would match down and up with lock and crack, resulting in

lockdown
lockup
crackdown
crackup

But not including 'upvote', 'downfall' or 'locksmith' (and it's this that causes me the biggest problems). I get false positives like:

pancake
cupcake
cupboard

But I'm just going round in "loops". (Pun intended) and I don't seem to get anywhere. I'd appreciate any kick in the right direction.

Confused and useless code so far, which you probably should just ignore:

findings = defaultdict(set)
for stem, endings in all_endings.items():
    # What stems have matching endings:
    for ending in endings:
        otherstems = all_stems[ending]
        if not otherstems:
            continue
        for otherstem in otherstems:
            # Find endings that also exist for other stems
            otherendings = all_endings[otherstem].intersection(endings)
            if otherendings:
                # Some kind of match
                findings[stem].add(otherstem)

# Go through this in order of what is the most stems that match:

MINMATCH = 2
for match in sorted(findings.values(), key=len, reverse=True):
    for this_stem in match:
        other_stems = set() # Stems that have endings in common with this_stem
        other_endings = set() # Endings this stem have in common with other stems
        this_endings = all_endings[this_stem]
        for this_ending in this_endings:
            for other_stem in all_stems[this_ending] - set([this_stem]):
                matching_endings = this_endings.intersection(all_endings[other_stem])
                if matching_endings:
                    other_endings.add(this_ending)
                    other_stems.add(other_stem)

        stem_matches = all_stems[other_endings.pop()]
        for other in other_endings:
            stem_matches = stem_matches.intersection(all_stems[other])

        if len(stem_matches) >= MINMATCH:
            for m in stem_matches:
                for e in all_endings[m]:
                    print(m+e)
like image 897
Lennart Regebro Avatar asked Jan 20 '11 16:01

Lennart Regebro


People also ask

Can affixes only attach to stems?

Affixes, unlike clitics, are categorially restrictive, i.e. they attach only to stems of a certain parts of speech. (English affix re- attaches only to verbs: re-use, but not numerals: *re-five.)

What is a Latin stem?

The stem is the part of the noun that the case endings are added to. It is the basic form of the word that appears in all case forms except the nominative singular of third declension nouns and a few second declension nouns (and the accusative singular, for third declension neuter nouns).

What is the difference between a root and a stem in linguistics?

A root differs partially from a stem in that a stem must have lexical meaning. A root has no lexical meaning and the semantic range of the root is vague if there is any at all. A stem may contain derivational affixes.

What is stem in linguistics?

In linguistics, a word stem is a part of a word responsible for its lexical meaning. The term is used with slightly different meanings depending on the morphology of the language in question.


1 Answers

It's not particularly pretty, but this is quite straightforward if you break your dictionary down into two lists, and use explicit indices:

all_stems = {
 'chair' : set(['high', 'arm']),
 'charge': set(['over', 'under']),
 'fall'  : set(['down', 'water', 'night']),
 'up'    : set(['lock', 'crack', 'vote']),
 'down'  : set(['lock', 'crack', 'fall']),
}

endings     = all_stems.keys()
stem_sets   = all_stems.values()

i = 0
for target_stem_set in stem_sets:
    i += 1
    j  = 0

    remaining_stems = stem_sets[i:]
    for remaining_stem_set in remaining_stems:
        j += 1
        union = target_stem_set & remaining_stem_set
        if len(union) > 1:
            print "%d matches found" % len(union)
            for stem in union:
                print "%s%s" % (stem, endings[i-1])
                print "%s%s" % (stem, endings[j+i-1])

Output:

$ python stems_and_endings.py 
2 matches found
lockdown
lockup
crackdown
crackup

Basically all we're doing is iterating through each set in turn, and comparing it with every remaining set to see if there are more than two matches. We never have to try sets that fall earlier than the current set, because they've already been compared in a prior iteration. The rest (indexing, etc.) is just book-keeping.

like image 191
ire_and_curses Avatar answered Sep 23 '22 12:09

ire_and_curses