Suppose I have the following string:
var englishSentence = 'Hellow World';
var persianSentence = 'گروه جوانان خلاق';
For english I use from following regex, but how can I write a regex to support Persian, or mix of them.
var matches = englishSentence.match(/\b(\w)/g);
acronym = matches.join('');
There is no way to match a Unicode word boundary, \b
is not Unicode aware even in ECMA 2018.
For ECMA2018 compatible browsers (e.g. the latest versions of Chrome as of April 2018) you may use:
var englishSentence = 'Hellow World';
var persianSentence = 'گروه جوانان خلاق';
var reg = /(?<!\p{L}\p{M}*)\p{L}\p{M}*/gu;
console.log(englishSentence.match(reg));
console.log(persianSentence.match(reg));
Details
(?<!\p{L}\p{M}*)
- a negative lookbehind that fails the match if there is a Unicode letter followed with 0+ diacritics\p{L}\p{M}*
- a Unicode letter followed with 0+ diacriticsgu
- g
- global, search for all matches, u
- make the pattern Unicode aware.If you need the same functionality in older/other browsers, use XRegExp
:
function getFirstLetters(s, regex) {
var results=[], match;
XRegExp.forEach(s, regex, function (match, i) {
results.push(match[1]);
});
return results;
}
var rx = XRegExp("(?:^|[^\\pL\\pM])(\\pL\\pM*)", "gu");
console.log(getFirstLetters("Hello world", rx));
console.log(getFirstLetters('گروه جوانان خلاق', rx));
<script src="https://cdnjs.cloudflare.com/ajax/libs/xregexp/3.2.0/xregexp-all.js"></script>
Details
(?:^|[^\\pL\\pM])
- a non-capturing group that matches the start of the string (^
) or any char other than a Unicode letter or diacritic(\\pL\\pM*)
- Group 1: any Unicode letter followed with 0+ diacritics.Here, we need to extract Group 1 value, hence .push(match[1])
upon each match.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With