Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RegEx for ukrainian letters. How to separate cyrillic words by capital letter?

I have a String with some cyrillic words inside. Each starts with a capital letter.

var str = 'ХєлпМіПліз';

I have found this solution str.match(/[А-Я][а-я]+/g).

But it returns me ["Пл"] insted of ["Хєлп", "Мі", "Пліз"]. Seems like it doesn't recognize ukrainian letters('і', 'є'), only russian.

So, How do I have to change that regex to include ukrainian letters?

like image 955
Vlad Holubiev Avatar asked Nov 26 '13 18:11

Vlad Holubiev


3 Answers

[А-Я] is not Cyrillic alphabet, it's just Russian!

Cyrillic is a writing system. It used in alphabets for many languages. (Like Latin: charset for West European languages, East European &c.)

To have both Russian and Ukrainian you'd get [А-ЯҐЄІЇ].

To add Belarisian: [А-ЯҐЄІЇЎ]

And for all Cyrillic chars (including Balcanian languages and Old Cyrillic), you can get it through Unicode subset class, like: \p{IsCyrillic}


To deal with Ukrainian separately:

[А-ЩЬЮЯҐЄІЇ] or [А-ЩЬЮЯҐЄІЇа-щьюяґєії] seems to be full Ukrainian alphabet of 33 letters in each case.

Apostrophe is not a letter, but occasionally included in alphabet, because it has an impact to the next vowel. Apostrophe is a part of the words, not divider. It may be displayed in a few ways:

27 "'" APOSTROPHE
60 "`" GRAVE ACCENT
2019 "’" RIGHT SINGLE QUOTATION MARK
2bc "ʼ" MODIFIER LETTER APOSTROPHE

and maybe some more.

Yes, it's a bit complicated with apostrophe. There is no common standard for it.

like image 68
daubmannus Avatar answered Oct 23 '22 02:10

daubmannus


Use \p{Lu} for uppercase match, \p{Ll} for lowercase, or \p{L} to match any letter

update: That works only for Java, not for JavaScript. Don't forget to include "apostrof", "ji" to your regexp

like image 24
Slava Medvediev Avatar answered Oct 23 '22 01:10

Slava Medvediev


Ukranian alphabet has four different words from the cyrillic alphabet, such as: [і, є, ї, ґ], also it can contain a single quote inside

"ґуля, з'їсти, істота, Європа".match(/[а-яієїґ\']+/ig)

i by the and will match the upper case, like with "Європа"

like image 7
Purkhalo Alex Avatar answered Oct 23 '22 02:10

Purkhalo Alex