Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing a regular expression to parse chinese pinyin [closed]

I have a working regular expression to parse PinYin which matches every valid PinYin and does not match invalid ones. I am wondering how I can optimize it.

^(?P<initial>ch|zh|sh|r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z|)
 (?P<final>
   (?:(?<=ch)uang|(?<=ch)ang|(?<=ch)eng|(?<=ch)ong|(?<=ch)uai|(?<=ch)uan|(?<=ch)ai|(?<=ch)an|(?<=ch)ao|(?<=ch)en|(?<=ch)ou|(?<=ch)ua|(?<=ch)ui|(?<=ch)un|(?<=ch)uo|(?<=ch)a|(?<=ch)e|(?<=ch)i|(?<=ch)u)
  |(?:(?<=zh)uang|(?<=zh)ang|(?<=zh)eng|(?<=zh)ong|(?<=zh)uai|(?<=zh)uan|(?<=zh)ai|(?<=zh)an|(?<=zh)ao|(?<=zh)ei|(?<=zh)en|(?<=zh)ou|(?<=zh)ua|(?<=zh)ui|(?<=zh)un|(?<=zh)uo|(?<=zh)a|(?<=zh)e|(?<=zh)i|(?<=zh)u)
  |(?:(?<=sh)uang|(?<=sh)ang|(?<=sh)eng|(?<=sh)uai|(?<=sh)uan|(?<=sh)ai|(?<=sh)an|(?<=sh)ao|(?<=sh)ei|(?<=sh)en|(?<=sh)ou|(?<=sh)ua|(?<=sh)ui|(?<=sh)un|(?<=sh)uo|(?<=sh)a|(?<=sh)e|(?<=sh)i|(?<=sh)u)
  |(?:(?<=c)ang|(?<=c)eng|(?<=c)ong|(?<=c)uan|(?<=c)ai|(?<=c)an|(?<=c)ao|(?<=c)en|(?<=c)ou|(?<=c)ui|(?<=c)un|(?<=c)uo|(?<=c)a|(?<=c)e|(?<=c)i|(?<=c)u)
  |(?:(?<=b)ang|(?<=b)eng|(?<=b)ian|(?<=b)iao|(?<=b)ing|(?<=b)ai|(?<=b)an|(?<=b)ao|(?<=b)ei|(?<=b)en|(?<=b)ie|(?<=b)in|(?<=b)a|(?<=b)i|(?<=b)o|(?<=b)u)
  |(?:(?<=d)ang|(?<=d)eng|(?<=d)ian|(?<=d)iao|(?<=d)ing|(?<=d)ong|(?<=d)uan|(?<=d)ai|(?<=d)an|(?<=d)ao|(?<=d)ei|(?<=d)en|(?<=d)ia|(?<=d)ie|(?<=d)iu|(?<=d)ou|(?<=d)ui|(?<=d)un|(?<=d)uo|(?<=d)a|(?<=d)e|(?<=d)i|(?<=d)u)
  |(?:(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)a|(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)ai
  |(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)an|(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)ang
  |(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)ao|(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)e

Above is an abbreviated version for sake of readability. The whole expression can be found at the end of this post.

I am specifically wondering if passing in two or more prefixes to an ending matcher would improve perfomance:

 (<=ch|zh|sh)uang|(<=ch|zh|sh)ang...

Thanks for your time and suggestions.

whole regex:

 ^(?P<initial>ch|zh|sh|r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z|)(?P<final>(?:(?<=ch)uang|(?<=ch)ang|(?<=ch)eng|(?<=ch)ong|(?<=ch)uai|(?<=ch)uan|(?<=ch)ai|(?<=ch)an|(?<=ch)ao|(?<=ch)en|(?<=ch)ou|(?<=ch)ua|(?<=ch)ui|(?<=ch)un|(?<=ch)uo|(?<=ch)a|(?<=ch)e|(?<=ch)i|(?<=ch)u)|(?:(?<=zh)uang|(?<=zh)ang|(?<=zh)eng|(?<=zh)ong|(?<=zh)uai|(?<=zh)uan|(?<=zh)ai|(?<=zh)an|(?<=zh)ao|(?<=zh)ei|(?<=zh)en|(?<=zh)ou|(?<=zh)ua|(?<=zh)ui|(?<=zh)un|(?<=zh)uo|(?<=zh)a|(?<=zh)e|(?<=zh)i|(?<=zh)u)|(?:(?<=sh)uang|(?<=sh)ang|(?<=sh)eng|(?<=sh)uai|(?<=sh)uan|(?<=sh)ai|(?<=sh)an|(?<=sh)ao|(?<=sh)ei|(?<=sh)en|(?<=sh)ou|(?<=sh)ua|(?<=sh)ui|(?<=sh)un|(?<=sh)uo|(?<=sh)a|(?<=sh)e|(?<=sh)i|(?<=sh)u)|(?:(?<=c)ang|(?<=c)eng|(?<=c)ong|(?<=c)uan|(?<=c)ai|(?<=c)an|(?<=c)ao|(?<=c)en|(?<=c)ou|(?<=c)ui|(?<=c)un|(?<=c)uo|(?<=c)a|(?<=c)e|(?<=c)i|(?<=c)u)|(?:(?<=b)ang|(?<=b)eng|(?<=b)ian|(?<=b)iao|(?<=b)ing|(?<=b)ai|(?<=b)an|(?<=b)ao|(?<=b)ei|(?<=b)en|(?<=b)ie|(?<=b)in|(?<=b)a|(?<=b)i|(?<=b)o|(?<=b)u)|(?:(?<=d)ang|(?<=d)eng|(?<=d)ian|(?<=d)iao|(?<=d)ing|(?<=d)ong|(?<=d)uan|(?<=d)ai|(?<=d)an|(?<=d)ao|(?<=d)ei|(?<=d)en|(?<=d)ia|(?<=d)ie|(?<=d)iu|(?<=d)ou|(?<=d)ui|(?<=d)un|(?<=d)uo|(?<=d)a|(?<=d)e|(?<=d)i|(?<=d)u)|(?:(?<=g)uang|(?<=g)ang|(?<=g)eng|(?<=g)ong|(?<=g)uai|(?<=g)uan|(?<=g)ai|(?<=g)an|(?<=g)ao|(?<=g)ei|(?<=g)en|(?<=g)ou|(?<=g)ua|(?<=g)ui|(?<=g)un|(?<=g)uo|(?<=g)a|(?<=g)e|(?<=g)u)|(?:(?<=f)ang|(?<=f)eng|(?<=f)iao|(?<=f)an|(?<=f)ei|(?<=f)en|(?<=f)ou|(?<=f)a|(?<=f)o|(?<=f)u)|(?:(?<!sh|ch|zh)(?<=h)uang|(?<!sh|ch|zh)(?<=h)ang|(?<!sh|ch|zh)(?<=h)eng|(?<!sh|ch|zh)(?<=h)ong|(?<!sh|ch|zh)(?<=h)uai|(?<!sh|ch|zh)(?<=h)uan|(?<!sh|ch|zh)(?<=h)ai|(?<!sh|ch|zh)(?<=h)an|(?<!sh|ch|zh)(?<=h)ao|(?<!sh|ch|zh)(?<=h)ei|(?<!sh|ch|zh)(?<=h)en|(?<!sh|ch|zh)(?<=h)ou|(?<!sh|ch|zh)(?<=h)ua|(?<!sh|ch|zh)(?<=h)ui|(?<!sh|ch|zh)(?<=h)un|(?<!sh|ch|zh)(?<=h)uo|(?<!sh|ch|zh)(?<=h)a|(?<!sh|ch|zh)(?<=h)e|(?<!sh|ch|zh)(?<=h)u)|(?:(?<=k)uang|(?<=k)ang|(?<=k)eng|(?<=k)ong|(?<=k)uai|(?<=k)uan|(?<=k)ai|(?<=k)an|(?<=k)ao|(?<=k)en|(?<=k)ou|(?<=k)ua|(?<=k)ui|(?<=k)un|(?<=k)uo|(?<=k)a|(?<=k)e|(?<=k)u)|(?:(?<=j)iang|(?<=j)iong|(?<=j)ian|(?<=j)iao|(?<=j)ing|(?<=j)üan|(?<=j)ia|(?<=j)ie|(?<=j)in|(?<=j)iu|(?<=j)üe|(?<=j)ün|(?<=j)i|(?<=j)ü)|(?:(?<=m)ang|(?<=m)eng|(?<=m)ian|(?<=m)iao|(?<=m)ing|(?<=m)ai|(?<=m)an|(?<=m)ao|(?<=m)ei|(?<=m)en|(?<=m)ie|(?<=m)in|(?<=m)iu|(?<=m)ou|(?<=m)a|(?<=m)e|(?<=m)i|(?<=m)o|(?<=m)u)|(?:(?<=l)iang|(?<=l)ang|(?<=l)eng|(?<=l)ian|(?<=l)iao|(?<=l)ing|(?<=l)ong|(?<=l)uan|(?<=l)ai|(?<=l)an|(?<=l)ao|(?<=l)ei|(?<=l)ia|(?<=l)ie|(?<=l)in|(?<=l)iu|(?<=l)ou|(?<=l)un|(?<=l)uo|(?<=l)üe|(?<=l)a|(?<=l)e|(?<=l)i|(?<=l)o|(?<=l)u|(?<=l)ü)|(?:(?<=n)iang|(?<=n)ang|(?<=n)eng|(?<=n)ian|(?<=n)iao|(?<=n)ing|(?<=n)ong|(?<=n)uan|(?<=n)ai|(?<=n)an|(?<=n)ao|(?<=n)ei|(?<=n)en|(?<=n)ie|(?<=n)in|(?<=n)iu|(?<=n)ou|(?<=n)un|(?<=n)uo|(?<=n)üe|(?<=n)a|(?<=n)e|(?<=n)i|(?<=n)u|(?<=n)ü)|(?:(?<=q)iang|(?<=q)iong|(?<=q)ian|(?<=q)iao|(?<=q)ing|(?<=q)üan|(?<=q)ia|(?<=q)ie|(?<=q)in|(?<=q)iu|(?<=q)üe|(?<=q)ün|(?<=q)i|(?<=q)ü)|(?:(?<=p)ang|(?<=p)eng|(?<=p)ian|(?<=p)iao|(?<=p)ing|(?<=p)ai|(?<=p)an|(?<=p)ao|(?<=p)ei|(?<=p)en|(?<=p)ie|(?<=p)in|(?<=p)ou|(?<=p)a|(?<=p)i|(?<=p)o|(?<=p)u)|(?:(?<=s)ang|(?<=s)eng|(?<=s)ong|(?<=s)uan|(?<=s)ai|(?<=s)an|(?<=s)ao|(?<=s)en|(?<=s)ou|(?<=s)ui|(?<=s)un|(?<=s)uo|(?<=s)a|(?<=s)e|(?<=s)i|(?<=s)u)|(?:(?<=r)ang|(?<=r)eng|(?<=r)ong|(?<=r)uan|(?<=r)an|(?<=r)ao|(?<=r)en|(?<=r)ou|(?<=r)ua|(?<=r)ui|(?<=r)un|(?<=r)uo|(?<=r)e|(?<=r)i|(?<=r)u)|(?:(?<=t)ang|(?<=t)eng|(?<=t)ian|(?<=t)iao|(?<=t)ing|(?<=t)ong|(?<=t)uan|(?<=t)ai|(?<=t)an|(?<=t)ao|(?<=t)ei|(?<=t)ie|(?<=t)ou|(?<=t)ui|(?<=t)un|(?<=t)uo|(?<=t)a|(?<=t)e|(?<=t)i|(?<=t)u)|(?:(?<=w)ang|(?<=w)eng|(?<=w)ai|(?<=w)an|(?<=w)ei|(?<=w)en|(?<=w)a|(?<=w)o|(?<=w)u)|(?:(?<=y)ang|(?<=y)ing|(?<=y)ong|(?<=y)uan|(?<=y)ai|(?<=y)an|(?<=y)ao|(?<=y)in|(?<=y)ou|(?<=y)ue|(?<=y)un|(?<=y)a|(?<=y)e|(?<=y)e|(?<=y)i|(?<=y)o|(?<=y)u)|(?:(?<=x)iang|(?<=x)iong|(?<=x)ian|(?<=x)iao|(?<=x)ing|(?<=x)üan|(?<=x)ia|(?<=x)ie|(?<=x)in|(?<=x)iu|(?<=x)üe|(?<=x)ün|(?<=x)i|(?<=x)ü)|(?:(?<=z)ang|(?<=z)eng|(?<=z)ong|(?<=z)uan|(?<=z)ai|(?<=z)an|(?<=z)ao|(?<=z)ei|(?<=z)en|(?<=z)ou|(?<=z)ui|(?<=z)un|(?<=z)uo|(?<=z)a|(?<=z)e|(?<=z)i|(?<=z)u)|(?:(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)a|(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)ai|(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)an|(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)ang|(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)ao|(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)e|(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)ei|(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)en|(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)eng|(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)er|(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)o|(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)ou))$
like image 820
tike Avatar asked Jun 17 '13 21:06

tike


1 Answers

Assuming that your regex engine supports lookbehinds, atomic groups and possessive quantifiers (that are PCRE features):

Some examples of what can be replaced:

  • all (?: by (?>

  • the begining (all the first named group) by:

    ^(?P<initial>(?>[csz]h?+|[bdfghj-npqrtwxy])?)

  • this part* by:

    |(?<![csz]h)(?<=h)(?>a(?>[io]|ng?+)?|e(?>i|ng?+)?|o(?>u|ng)|u(?>[ino]|a(?>i|ng?+)?)?)

*( ie: |(?:(?<!sh|ch|zh)(?<=h)uang|(?<!sh|ch|...|(?<!sh|ch|zh)(?<=h)u) )

  • the last part* by:

    |(?<![bcdfghj-np-tw-z])(?>a(?>[io]|ng?+)?|e(?>[ir]|ng?+)?|ou?+))$

*( ie:|(?:(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)a|(?<!r|c|b|d|...))$ )

How to deal with the other parts:

example:

(?:(?<=ch)uang|(?<=ch)ang|(?<=ch)eng|(?<=ch)ong|(?<=ch)uai|(?<=ch)uan|(?<=ch)ai|(?<=ch)an|(?<=ch)ao|(?<=ch)en|(?<=ch)ou|(?<=ch)ua|(?<=ch)ui|(?<=ch)un|(?<=ch)uo|(?<=ch)a|(?<=ch)e|(?<=ch)i|(?<=ch)u)

_ all this kind of parts has the same lookbehind, you must do these steps for each _

# step 1: lookarounds factorization

(?<=ch)(?>ang|eng|ong|uai|uan|ai|an|ao|en|ou|ua|ui|un|uo|a|e|i|u)

# step 2: sort all the content by alphabetic order 

(?<=ch)(?>a|ai|an|ang|ao|e|en|eng|i|ong|ou|u|ua|uai|uan|ui|un|uo)

# step 3: group by first letter: don't forget the ? if the letter can be alone

(?<=ch)(?>a(?>i|n|ng|o)?|e(?>n|ng)?|i|o(?>ng|u)|u(?>a|ai|an|i|n|o)?)

# step 4: reduce the terminations (ie: n & ng => ng?+) 

(?<=ch)(?>a(?>i|ng?+|o)?|e(?>ng?+)?|i|o(?>ng|u)|u(?>a[in]?+|i|n|o)?)

# step 5: put single letters in a character class

(?<=ch)(?>a(?>[io]|ng?+)?|e(?>ng?+)?|i|o(?>ng|u)|u(?>a[in]?+|[ino])?)

conclusion

Although the result is shorter, the goal here is optimization. I reduced the number of tests with the factorization and the number of backtracks using atomic groups and possessive quantifiers.

some limitations

Note that regex features like atomic groups and possessive quantifiers are not supported by all regex flavors, but it is possible to remedy the problem:

  • for flavors that don't support possessive quantifiers: change ?+ to ?
  • for flavors that don't support atomic groups: change (?> to (?:

(Note that there is a trick to have atomic groups with Python, which you may test with a timer, to surround all the pattern. See this incredible post: Do Python regular expressions have an equivalent to Ruby's atomic grouping?)

Some regex engines such as javascript do not support lookbehinds. In this case, you must rewrite all your pattern using only alternations (ie |), which isn't a bad thing, since lookbehinds make your pattern slower; and give up the named captures that are not supported too. (In this context, it should be noted that to remove negative lookbehinds you need to put syllables described in these parts before all others so that they are matched first.)

other ways of optimization

  • rewrite your pattern without lookbehinds and with | instead
  • sort the different lines by the most used syllables
like image 60
Casimir et Hippolyte Avatar answered Oct 10 '22 13:10

Casimir et Hippolyte