Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find non-alphabets using Java

I am processing text corpus. It contains several characters belonging to different languages, symbols, numbers, etc.

-> All I need to do is to skip the symbols like arrow mark, heart symbol, etc.

-> I should not be spoiling any characters of different languages.

Any leads?

----UPDATE----

Character.isLetter('\unicode') is working for most of them, if not some. I have checked my regional languages, it seems it's working for some but not each and every.

Thanks.

like image 942
Firefox Avatar asked Feb 24 '26 12:02

Firefox


1 Answers

If i understnad correctly, the characters you want to remove are of a rather limited set. Why not just check for these? Unicode has a whole bunch of non-letter characters, but in your case, the non-letter characters encountered will probably be a small subset of what exists.

Sounds like a job for regular expressions, if you ask me. Remove everything that's not a word character, digit or whitespace, and you've probably got it. Or create an array containing all characters you want filtered out (which in that case should be few and known).

like image 137
Arne Avatar answered Feb 27 '26 00:02

Arne