Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting text to sentences and sentence to words: BreakIterator vs regular expressions

I accidentally answered a question where the original problem involved splitting sentence to separate words.

And the author suggested to use BreakIterator to tokenize input strings and some people liked this idea.

I just don't get that madness: how 25 lines of complicated code can be better than a simple one-liner with regexp?

Please, explain me the pros of using BreakIterator and the real cases when it should be used.

If it's really so cool and proper then I wonder: do you really use the approach with BreakIterator in your projects?

like image 375
Roman Avatar asked Dec 19 '10 10:12

Roman


1 Answers

From looking at the code posted at that answer, it looks like BreakIterator takes into consideration the language and locale of the text. Getting that level of support via regex will surely be a considerable pain. Perhaps that is the main reason it is preferred over a simple regex?

like image 76
MAK Avatar answered Oct 03 '22 05:10

MAK