I have mappings of "stems" and "endings" (may not be the correct words) that look like so: <pre class="prettyprint"><code>all_endings = { 'birth': set(['place', 'day', 'mark']), 'snow': set(['plow', 'storm', 'flake', 'man']), 'shoe': set(['lace', 'string', 'maker']), 'lock': set(['down', 'up', 'smith']), 'crack': set(['down', 'up',]), 'arm': set(['chair']), 'high': set(['chair']), 'over': set(['charge']), 'under': set(['charge']), } </code></pre> But much longer, of course. I also made the corresponding dictionary the other way around: <pre class="prettyprint"><code>all_stems = { 'chair': set(['high', 'arm']), 'charge': set(['over', 'under']), 'up': set(['lock', 'crack', 'vote']), 'down': set(['lock', 'crack', 'fall']), 'smith': set(['lock']), 'place': set(['birth']), 'day': set(['birth']), 'mark': set(['birth']), 'plow': set(['snow']), 'storm': set(['snow']), 'flake': set(['snow']), 'man': set(['snow']), 'lace': set(['shoe']), 'string': set(['shoe']), 'maker': set(['shoe']), } </code></pre> I've now tried to come up with an algorithm to find any match of two or more "stems" that match two or more "endings". Above, for example, it would match down and up with lock and crack, resulting in <pre class="prettyprint"><code>lockdown lockup crackdown crackup </code></pre> But not including <code>'upvote', 'downfall' or 'locksmith'</code> (and it's this that causes me the biggest problems). I get false positives like: <pre class="prettyprint"><code>pancake cupcake cupboard </code></pre> But I'm just going round in "loops". (Pun intended) and I don't seem to get anywhere. I'd appreciate any kick in the right direction. Confused and useless code so far, which you probably should just ignore: <pre class="prettyprint"><code>findings = defaultdict(set) for stem, endings in all_endings.items(): # What stems have matching endings: for ending in endings: otherstems = all_stems[ending] if not otherstems: continue for otherstem in otherstems: # Find endings that also exist for other stems otherendings = all_endings[otherstem].intersection(endings) if otherendings: # Some kind of match findings[stem].add(otherstem) # Go through this in order of what is the most stems that match: MINMATCH = 2 for match in sorted(findings.values(), key=len, reverse=True): for this_stem in match: other_stems = set() # Stems that have endings in common with this_stem other_endings = set() # Endings this stem have in common with other stems this_endings = all_endings[this_stem] for this_ending in this_endings: for other_stem in all_stems[this_ending] - set([this_stem]): matching_endings = this_endings.intersection(all_endings[other_stem]) if matching_endings: other_endings.add(this_ending) other_stems.add(other_stem) stem_matches = all_stems[other_endings.pop()] for other in other_endings: stem_matches = stem_matches.intersection(all_stems[other]) if len(stem_matches) >= MINMATCH: for m in stem_matches: for e in all_endings[m]: print(m+e) </code></pre>

It's not particularly pretty, but this is quite straightforward if you break your dictionary down into two lists, and use explicit indices: <pre class="prettyprint"><code>all_stems = { 'chair' : set(['high', 'arm']), 'charge': set(['over', 'under']), 'fall' : set(['down', 'water', 'night']), 'up' : set(['lock', 'crack', 'vote']), 'down' : set(['lock', 'crack', 'fall']), } endings = all_stems.keys() stem_sets = all_stems.values() i = 0 for target_stem_set in stem_sets: i += 1 j = 0 remaining_stems = stem_sets[i:] for remaining_stem_set in remaining_stems: j += 1 union = target_stem_set & remaining_stem_set if len(union) > 1: print "%d matches found" % len(union) for stem in union: print "%s%s" % (stem, endings[i-1]) print "%s%s" % (stem, endings[j+i-1]) </code></pre> Output: <pre class="prettyprint"><code>$ python stems_and_endings.py 2 matches found lockdown lockup crackdown crackup </code></pre> Basically all we're doing is iterating through each set in turn, and comparing it with every remaining set to see if there are more than two matches. We never have to try sets that fall earlier than the current set, because they've already been compared in a prior iteration. The rest (indexing, etc.) is just book-keeping.

Finding combinations of stems and endings

Tags:

python

I have mappings of "stems" and "endings" (may not be the correct words) that look like so:

all_endings = {
 'birth': set(['place', 'day', 'mark']), 
 'snow': set(['plow', 'storm', 'flake', 'man']),
 'shoe': set(['lace', 'string', 'maker']),
 'lock': set(['down', 'up', 'smith']),
 'crack': set(['down', 'up',]),
 'arm': set(['chair']),
 'high': set(['chair']),
 'over': set(['charge']),
 'under': set(['charge']),
}

But much longer, of course. I also made the corresponding dictionary the other way around:

all_stems = {
 'chair': set(['high', 'arm']),
 'charge': set(['over', 'under']),
 'up': set(['lock', 'crack', 'vote']),
 'down': set(['lock', 'crack', 'fall']),
 'smith': set(['lock']),
 'place': set(['birth']),
 'day': set(['birth']),
 'mark': set(['birth']),
 'plow': set(['snow']),
 'storm': set(['snow']),
 'flake': set(['snow']),
 'man': set(['snow']),
 'lace': set(['shoe']),
 'string': set(['shoe']),
 'maker': set(['shoe']),
}

I've now tried to come up with an algorithm to find any match of two or more "stems" that match two or more "endings". Above, for example, it would match down and up with lock and crack, resulting in

lockdown
lockup
crackdown
crackup

But not including 'upvote', 'downfall' or 'locksmith' (and it's this that causes me the biggest problems). I get false positives like:

pancake
cupcake
cupboard

But I'm just going round in "loops". (Pun intended) and I don't seem to get anywhere. I'd appreciate any kick in the right direction.

Confused and useless code so far, which you probably should just ignore:

findings = defaultdict(set)
for stem, endings in all_endings.items():
    # What stems have matching endings:
    for ending in endings:
        otherstems = all_stems[ending]
        if not otherstems:
            continue
        for otherstem in otherstems:
            # Find endings that also exist for other stems
            otherendings = all_endings[otherstem].intersection(endings)
            if otherendings:
                # Some kind of match
                findings[stem].add(otherstem)

# Go through this in order of what is the most stems that match:

MINMATCH = 2
for match in sorted(findings.values(), key=len, reverse=True):
    for this_stem in match:
        other_stems = set() # Stems that have endings in common with this_stem
        other_endings = set() # Endings this stem have in common with other stems
        this_endings = all_endings[this_stem]
        for this_ending in this_endings:
            for other_stem in all_stems[this_ending] - set([this_stem]):
                matching_endings = this_endings.intersection(all_endings[other_stem])
                if matching_endings:
                    other_endings.add(this_ending)
                    other_stems.add(other_stem)

        stem_matches = all_stems[other_endings.pop()]
        for other in other_endings:
            stem_matches = stem_matches.intersection(all_stems[other])

        if len(stem_matches) >= MINMATCH:
            for m in stem_matches:
                for e in all_endings[m]:
                    print(m+e)

897

asked Jan 20 '11 16:01

Lennart Regebro

1 Answers

It's not particularly pretty, but this is quite straightforward if you break your dictionary down into two lists, and use explicit indices:

all_stems = {
 'chair' : set(['high', 'arm']),
 'charge': set(['over', 'under']),
 'fall'  : set(['down', 'water', 'night']),
 'up'    : set(['lock', 'crack', 'vote']),
 'down'  : set(['lock', 'crack', 'fall']),
}

endings     = all_stems.keys()
stem_sets   = all_stems.values()

i = 0
for target_stem_set in stem_sets:
    i += 1
    j  = 0

    remaining_stems = stem_sets[i:]
    for remaining_stem_set in remaining_stems:
        j += 1
        union = target_stem_set & remaining_stem_set
        if len(union) > 1:
            print "%d matches found" % len(union)
            for stem in union:
                print "%s%s" % (stem, endings[i-1])
                print "%s%s" % (stem, endings[j+i-1])

Output:

$ python stems_and_endings.py 
2 matches found
lockdown
lockup
crackdown
crackup

Basically all we're doing is iterating through each set in turn, and comparing it with every remaining set to see if there are more than two matches. We never have to try sets that fall earlier than the current set, because they've already been compared in a prior iteration. The rest (indexing, etc.) is just book-keeping.

191

answered Sep 23 '22 12:09

ire_and_curses

Related questions
                            
                                escape problem in django templates
                            
                                How do you reload your Python source into the console window in Eclipse/Pydev?
                            
                                Python memory footprint vs. heap size
                            
                                Django many-to-many relations, and through
                            
                                Weird python behaviour on machine with ARM CPU
                            
                                IronPython libraries for scientific plots [closed]
                            
                                How to make python window run as "Always On Top"?
                            
                                Python script to print all function definitions of a C/C++ file
                            
                                How to return the count of related entities in sqlalchemy query
                            
                                Determine cluster size of file system in Python
                            
                                Bounced email on Google App Engine
                            
                                is there anything exist to convert xml -> yaml directly?
                            
                                Dragon NaturallySpeaking Programmers
                            
                                Scipy interpolation on a numpy array
                            
                                How to create logic bricks programmatically in Blender?
                            
                                Are there an GUI libraries for Python that allow you to compile to an EXE (windows) and APP(Mac)?
                            
                                Embedding Python on Windows: why does it have to be a DLL?
                            
                                Dynamic Programming Algorithm for Segmented Least Squares
                            
                                Interactive vector graphics canvas for python?
                            
                                How to decrease hatch density in matplotlib

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With