I have to clean some input from OCR which recognizes handwriting as gibberish. Any suggestions for a regex to clean out the random characters? Example:
Federal prosecutors on Monday charged a Miami man with the largest case of credit and debit card data theft ever in the United States, accusing the one-time government informant of swiping 130 million accounts on top of 40 million he stole previously. , ':, Ie ':... 11'1 . '(.. ~!' ': f I I . " .' I ~ I' ,11 l I I I ~ \ :' ,! .~ , .. r, 1 , ~ I . I' , .' I ,. , i I ; J . I.' ,.\ ) .. . : I 'I', I .' ' r," Gonzalez is a former informant for the U.S. Secret Service who helped the agency hunt hackers, authorities say. The agency later found out that he had also been working with criminals and feeding them information on ongoing investigations, even warning off at least one individual, according to authorities. eh....l ~.\O ::t e;~~~ s: ~ ~. 0 qs c::; ~ g o t/J (Ii ., ::3 (1l Il:l ~ cil~ 0 2: t:lHj~(1l . ~ ~a 0~ ~ S' N ("b t/J :s Ot/JIl:l"-<:! v'g::!t:O -....c...... VI (:ll <' 0 := - ~ < (1l ::3 (1l ~ ' t/J VJ ~ Pl ..... .... (II
One of the simpleset solutions(not involving regexpes):
#pseudopython
number_of_punct = sum([1 if c.ispunct() else 0 for c in line])
if number_of_punct >len(line)/2: line_is_garbage()
well. Or rude regexpish s/[!,'"@#~$%^& ]{5,}//g
A simple heuristic, similar to anonymous answer:
listA = [0,1,2..9, a,b,c..z, A,B,C,..Z , ...] // alphanumerical symbols
listB = [!@$%^&...] // other symbols
Na = number_of_alphanumeric_symbols( line )
Nb = number_of_other_symbols( line )
if Na/Nb <= garbage_ratio then
// garbage
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With