Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode character regular expression, capture groups

I got a regular expression \p{L}\p{M}* which I use to split words into characters, this is particularly needed with hindi or thai words where the character can contains multiple 'characters' in them, such as मछली if split in a regular way in Java I get [म][छ][ल][ी] Where as I want [म][छ][ली]

I have been trying to improve this regular expression to include space characters as well so that when I split फार्म पशु I would get the followng groups [फा][र्][म][ ][प][शु]

But I haven't had any luck. Would anyone be able to help me out?

Also, if anyone has a alternative way of doing this is java that could be an alternative solution too. My current java code is

Pattern pat = Pattern.compile("\\p{L}\\p{M}*");
    Matcher matcher = pat.matcher(word);
    while (matcher.find()) {
        characters.add(matcher.group());
    }
like image 499
DianeH Avatar asked Mar 19 '23 17:03

DianeH


1 Answers

Consider using the BreakIterator:

String text = "मछली";
Locale hindi = new Locale("hi", "IN");
BreakIterator breaker = BreakIterator.getCharacterInstance(hindi);
breaker.setText(text);
int start = breaker.first();
for (int end = breaker.next();
  end != BreakIterator.DONE;
  start = end, end = breaker.next()) {
  System.out.println(text.substring(start,end));
}

I tested the sample string using the Oracle Java 8 implementation. Also consider the ICU4J version of BreakIterator if required.

like image 53
McDowell Avatar answered Apr 02 '23 11:04

McDowell