How to find non-alphabets using Java

Question

I am processing text corpus. It contains several characters belonging to different languages, symbols, numbers, etc.

-> All I need to do is to skip the symbols like arrow mark, heart symbol, etc.

-> I should not be spoiling any characters of different languages.

Any leads?

----UPDATE----

Character.isLetter('\unicode') is working for most of them, if not some. I have checked my regional languages, it seems it's working for some but not each and every.

Thanks.

Arne · Accepted Answer

If i understnad correctly, the characters you want to remove are of a rather limited set. Why not just check for these? Unicode has a whole bunch of non-letter characters, but in your case, the non-letter characters encountered will probably be a small subset of what exists.

Sounds like a job for regular expressions, if you ask me. Remove everything that's not a word character, digit or whitespace, and you've probably got it. Or create an array containing all characters you want filtered out (which in that case should be few and known).

How to find non-alphabets using Java

Tags:

java

character

special-characters

nlp

Firefox

1 Answers

Arne

Recent Activity

Donate For Us

How to find non-alphabets using Java

Tags:

java

character

special-characters

nlp

Firefox

1 Answers

Arne

Related questions

Recent Activity

Donate For Us