giving a database table with huge data in it, what is the best practice to remove noise text such as :
that noise is stored into the "name" field.
I'm working on data with Java standard structures.
Removing stuff like that isn't as easy as it might seem.
For us humans, it's easy to see that "djkhfkjh" doesn't make any sense. But how would a computer detect this kind of noise? How would it know that "Eyjafjallajökull" is just someone smashing his keyboard, or the most overbuzzed mountain in the last couple of years?
You can't do this reliably without many false positives, so after all, it's filtering the false-positives and true-positives by hand again.
Well, you can build a classifier using NLP methods, and train it on examples of noise and not-noise. One case of that you can take is the language detector from Apache Tika. If the language detector says 'beats me' that might be good enough.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With