I have mappings of "stems" and "endings" (may not be the correct words) that look like so:
all_endings = {
'birth': set(['place', 'day', 'mark']),
'snow': set(['plow', 'storm', 'flake', 'man']),
'shoe': set(['lace', 'string', 'maker']),
'lock': set(['down', 'up', 'smith']),
'crack': set(['down', 'up',]),
'arm': set(['chair']),
'high': set(['chair']),
'over': set(['charge']),
'under': set(['charge']),
}
But much longer, of course. I also made the corresponding dictionary the other way around:
all_stems = {
'chair': set(['high', 'arm']),
'charge': set(['over', 'under']),
'up': set(['lock', 'crack', 'vote']),
'down': set(['lock', 'crack', 'fall']),
'smith': set(['lock']),
'place': set(['birth']),
'day': set(['birth']),
'mark': set(['birth']),
'plow': set(['snow']),
'storm': set(['snow']),
'flake': set(['snow']),
'man': set(['snow']),
'lace': set(['shoe']),
'string': set(['shoe']),
'maker': set(['shoe']),
}
I've now tried to come up with an algorithm to find any match of two or more "stems" that match two or more "endings". Above, for example, it would match down and up with lock and crack, resulting in
lockdown
lockup
crackdown
crackup
But not including 'upvote', 'downfall' or 'locksmith'
(and it's this that causes me the biggest problems). I get false positives like:
pancake
cupcake
cupboard
But I'm just going round in "loops". (Pun intended) and I don't seem to get anywhere. I'd appreciate any kick in the right direction.
Confused and useless code so far, which you probably should just ignore:
findings = defaultdict(set)
for stem, endings in all_endings.items():
# What stems have matching endings:
for ending in endings:
otherstems = all_stems[ending]
if not otherstems:
continue
for otherstem in otherstems:
# Find endings that also exist for other stems
otherendings = all_endings[otherstem].intersection(endings)
if otherendings:
# Some kind of match
findings[stem].add(otherstem)
# Go through this in order of what is the most stems that match:
MINMATCH = 2
for match in sorted(findings.values(), key=len, reverse=True):
for this_stem in match:
other_stems = set() # Stems that have endings in common with this_stem
other_endings = set() # Endings this stem have in common with other stems
this_endings = all_endings[this_stem]
for this_ending in this_endings:
for other_stem in all_stems[this_ending] - set([this_stem]):
matching_endings = this_endings.intersection(all_endings[other_stem])
if matching_endings:
other_endings.add(this_ending)
other_stems.add(other_stem)
stem_matches = all_stems[other_endings.pop()]
for other in other_endings:
stem_matches = stem_matches.intersection(all_stems[other])
if len(stem_matches) >= MINMATCH:
for m in stem_matches:
for e in all_endings[m]:
print(m+e)
Affixes, unlike clitics, are categorially restrictive, i.e. they attach only to stems of a certain parts of speech. (English affix re- attaches only to verbs: re-use, but not numerals: *re-five.)
The stem is the part of the noun that the case endings are added to. It is the basic form of the word that appears in all case forms except the nominative singular of third declension nouns and a few second declension nouns (and the accusative singular, for third declension neuter nouns).
A root differs partially from a stem in that a stem must have lexical meaning. A root has no lexical meaning and the semantic range of the root is vague if there is any at all. A stem may contain derivational affixes.
In linguistics, a word stem is a part of a word responsible for its lexical meaning. The term is used with slightly different meanings depending on the morphology of the language in question.
It's not particularly pretty, but this is quite straightforward if you break your dictionary down into two lists, and use explicit indices:
all_stems = {
'chair' : set(['high', 'arm']),
'charge': set(['over', 'under']),
'fall' : set(['down', 'water', 'night']),
'up' : set(['lock', 'crack', 'vote']),
'down' : set(['lock', 'crack', 'fall']),
}
endings = all_stems.keys()
stem_sets = all_stems.values()
i = 0
for target_stem_set in stem_sets:
i += 1
j = 0
remaining_stems = stem_sets[i:]
for remaining_stem_set in remaining_stems:
j += 1
union = target_stem_set & remaining_stem_set
if len(union) > 1:
print "%d matches found" % len(union)
for stem in union:
print "%s%s" % (stem, endings[i-1])
print "%s%s" % (stem, endings[j+i-1])
Output:
$ python stems_and_endings.py
2 matches found
lockdown
lockup
crackdown
crackup
Basically all we're doing is iterating through each set in turn, and comparing it with every remaining set to see if there are more than two matches. We never have to try sets that fall earlier than the current set, because they've already been compared in a prior iteration. The rest (indexing, etc.) is just book-keeping.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With