Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detect and remove noise text [closed]

Tags:

java

text

noise

giving a database table with huge data in it, what is the best practice to remove noise text such as :

  • fghfghfghfg
  • qsdqsdqsd
  • rtyrtyrty

that noise is stored into the "name" field.

I'm working on data with Java standard structures.

like image 373
Youssef Avatar asked May 13 '10 13:05

Youssef


2 Answers

Removing stuff like that isn't as easy as it might seem.

For us humans, it's easy to see that "djkhfkjh" doesn't make any sense. But how would a computer detect this kind of noise? How would it know that "Eyjafjallajökull" is just someone smashing his keyboard, or the most overbuzzed mountain in the last couple of years?

You can't do this reliably without many false positives, so after all, it's filtering the false-positives and true-positives by hand again.

like image 133
LukeN Avatar answered Nov 15 '22 06:11

LukeN


Well, you can build a classifier using NLP methods, and train it on examples of noise and not-noise. One case of that you can take is the language detector from Apache Tika. If the language detector says 'beats me' that might be good enough.

like image 25
bmargulies Avatar answered Nov 15 '22 07:11

bmargulies