I want to split a multi-lingual string to uni-lingual tokens using Regex.
for example for this English-Arabic string :
'his name was محمد, and his mother name was آمنه.'
The result must be as below:
It's not perfect (you definitely need to try it on some real-world examples to see if it fits), but it's a start:
splitArray = Regex.Split(subjectString,
@"(?<=\p{IsArabic}) # (if the previous character is Arabic)
[\p{Zs}\p{P}]+ # split on whitespace/punctuation
(?=\p{IsBasicLatin}) # (if the following character is Latin)
| # or
(?<=\p{IsBasicLatin}) # vice versa
[\s\p{P}]+
(?=\p{IsArabic})",
RegexOptions.IgnorePatternWhitespace);
This splits on whitespace/punctuation if the preceding character is from the Arabic block and the following character is from the Basic Latin block (or vice versa).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With