Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex for Matching Pinyin

Tags:

regex

cjk

I'm looking for a regular expression that can correctly match valid pinyin (e.g. "sheng", "sou" (while ignoring invalid pinyin, e.g. "shong", "sei"). Most of the regex provided in the top Google results match invalid pinyin in some cases.

Obviously, no matter what approach one takes, this will be a monster regex, and I'm especially interested in the different approaches one could take to solve this problem. For example, "Optimizing a regular expression to parse chinese pinyin" uses lookbacks.

A table of valid pinyin can be found here: http://pinyin.info/rules/initials_finals.html

like image 266
stevendaniels Avatar asked Dec 23 '13 02:12

stevendaniels


2 Answers

I went for a regex that grouped smaller regexes by the pinyin's initial (usually the first letter). So, the first group includes all "b", "p" and "m" sounds, then "f", then "d" and "t", etc.

This approach seems easy to read and should be easy to edit (if it needs corrections or additions). I also added exceptions to the begging of groups in order to improve readability.

([mM]iu|[pmPM]ou|[bpmBPM](o|e(i|ng?)?|a(ng?|i|o)?|i(e|ng?|a[no])?|u))|
([fF](ou?|[ae](ng?|i)?|u))|([dD](e(i|ng?)|i(a[on]?|u))|
[dtDT](a(i|ng?|o)?|e(i|ng)?|i(a[on]?|e|ng|u)?|o(ng?|u)|u(o|i|an?|n)?))|
([nN]eng?|[lnLN](a(i|ng?|o)?|e(i|ng)?|i(ang|a[on]?|e|ng?|u)?|o(ng?|u)|u(o|i|an?|n)?|ve?))|
([ghkGHK](a(i|ng?|o)?|e(i|ng?)?|o(u|ng)|u(a(i|ng?)?|i|n|o)?))|
([zZ]h?ei|[czCZ]h?(e(ng?)?|o(ng?|u)?|ao|u?a(i|ng?)?|u?(o|i|n)?))|
([sS]ong|[sS]hua(i|ng?)?|[sS]hei|[sS][h]?(a(i|ng?|o)?|en?g?|ou|u(a?n|o|i)?|i))|
([rR]([ae]ng?|i|e|ao|ou|ong|u[oin]|ua?n?))|
([jqxJQX](i(a(o|ng?)?|[eu]|ong|ng?)?|u(e|a?n)?))|
(([aA](i|o|ng?)?|[oO]u?|[eE](i|ng?|r)?))|
([wW](a(i|ng?)?|o|e(i|ng?)?|u))|
[yY](a(o|ng?)?|e|in?g?|o(u|ng)?|u(e|a?n)?)

Here's the Debuggex example I created.

Regular expression visualization

like image 67
stevendaniels Avatar answered Nov 07 '22 17:11

stevendaniels


I would use a combination approach that is not solely regex.

Check for valid pinyin:

  1. grab word

  2. grab letters from the beginning of the word as long as they are consonants. This separates the initial sound from the final sound.

  3. check that the initial and final are valid...

  4. ...and if so, see if their combination is allowed (via a table like this, but the entries are simply 1's and 0's).

like image 33
mareoraft Avatar answered Nov 07 '22 18:11

mareoraft