I have a dict of words (actually I have nested dicts of verb conjugations, but that isn't relevant) and I want to make a regex by combining them.
{
'yo': 'hablaba',
'tú': 'hablabas',
'él': 'hablaba',
'nosotros': 'hablábamos',
'vosotros': 'hablabais',
'ellos': 'hablaban',
'vos': 'hablabas',
}
... to make:
'habl((aba(s|is|n)?)|ábamos)' # I think that's right
If I don't include 'hablábamos'
it's easy - they're all the same prefix, and I can get:
'hablaba(s|is|n)?'
... but I want a general form. Is that possible?
Yes, I believe this is possible.
To get you started, this is how I would break down the problem.
Calculate the root by finding the longest possible string that matches the start of all of the declined values:
>>> root = ''
>>> for c in hablar['yo']:
... if all(v.startswith(root + c) for v in hablar.itervalues()):
... root += c
... else:
... break
...
>>> root
'habl'
Whatever's left of the words makes a list
of endings.
>>> endings = [v[len(root):] for v in hablar.itervalues()]
>>> print endings
['abas', 'aba', 'abais', 'aba', '\xc3\xa1bamos', 'aban', 'abas']
You may then want to weed out the duplicates:
>>> unique_endings = set(endings)
>>> print unique_endings
set(['abas', 'abais', '\xc3\xa1bamos', 'aban', 'aba'])
Then join these endings together with pipes:
>>> conjoined_endings = '|'.join(unique_endings)
>>> print conjoined_endings
abas|abais|ábamos|aban|aba
Forming the regular expression is a simple matter combining the root and the conjoined_endings string in parentheses:
>>> final_regex = '{}({})'.format(root, conjoined_endings)
>>> print final_regex
habl(abas|abais|ábamos|aban|aba)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With