Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split multi-lingual string using Regex to uni-lingual tokens

Tags:

c#

regex

I want to split a multi-lingual string to uni-lingual tokens using Regex.

for example for this English-Arabic string :

'his name was محمد, and his mother name was آمنه.'

The result must be as below:

  1. 'his name was '
  2. 'محمد,'
  3. ' and his mother name was '
  4. 'آمنه.'
like image 233
ARZ Avatar asked Apr 16 '12 05:04

ARZ


1 Answers

It's not perfect (you definitely need to try it on some real-world examples to see if it fits), but it's a start:

splitArray = Regex.Split(subjectString, 
    @"(?<=\p{IsArabic})    # (if the previous character is Arabic)
    [\p{Zs}\p{P}]+         # split on whitespace/punctuation
    (?=\p{IsBasicLatin})   # (if the following character is Latin)
    |                      # or
    (?<=\p{IsBasicLatin})  # vice versa
    [\s\p{P}]+
    (?=\p{IsArabic})", 
    RegexOptions.IgnorePatternWhitespace);

This splits on whitespace/punctuation if the preceding character is from the Arabic block and the following character is from the Basic Latin block (or vice versa).

like image 153
Tim Pietzcker Avatar answered Sep 19 '22 15:09

Tim Pietzcker