Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to capture Hebrew with regex in Java?

Tags:

java

regex

hebrew

I'm trying to catch a section of Hebrew text (the origin is comments on a news site) using the following regex:

[\u0590-\u05FF \\p{Graph} \\s]+

It works for most comments but some comments are missed.

I've tried to debug this and it seems there's a Hebrew letter that doesn't match the pattern.

When I extract this letter and print it's integer value it seems to be correct but still the regex doesn't catch it...

Ideas?

like image 386
lribinik Avatar asked Jan 24 '12 12:01

lribinik


1 Answers

It would be more sematically correct to use \p{InHebrew} instead of \u0590-\u05FF

Also you need to match punctuation, digits (at least, world-common ones) and different kind of spaces. I don't know what is \p{Graph} and are there any Hebrew-specific punctuation symbols, but it seemed, you missed some parts.

like image 172
kirilloid Avatar answered Oct 03 '22 04:10

kirilloid