I am processing text corpus. It contains several characters belonging to different languages, symbols, numbers, etc.
-> All I need to do is to skip the symbols like arrow mark, heart symbol, etc.
-> I should not be spoiling any characters of different languages.
Any leads?
----UPDATE----
Character.isLetter('\unicode') is working for most of them, if not some. I have checked my regional languages, it seems it's working for some but not each and every.
Thanks.
If i understnad correctly, the characters you want to remove are of a rather limited set. Why not just check for these? Unicode has a whole bunch of non-letter characters, but in your case, the non-letter characters encountered will probably be a small subset of what exists.
Sounds like a job for regular expressions, if you ask me. Remove everything that's not a word character, digit or whitespace, and you've probably got it. Or create an array containing all characters you want filtered out (which in that case should be few and known).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With