Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to replace gibberish

Tags:

regex

I have to clean some input from OCR which recognizes handwriting as gibberish. Any suggestions for a regex to clean out the random characters? Example:


Federal prosecutors on Monday charged a Miami man with the largest 
case of credit and debit card data theft ever in the United States, 
accusing the one-time government informant of swiping 130 million 
accounts on top of 40 million he stole previously.

, ':, Ie
':... 11'1
. '(.. ~!' ': f I I
. " .' I ~
I' ,11 l
I I I ~ \ :' ,! .~ , .. r, 1 , ~ I . I' , .' I ,.
, i
I ; J . I.' ,.\ ) ..
. : I
'I', I
.' '
r,"

Gonzalez is a former informant for the U.S. Secret Service who helped 
the agency hunt hackers, authorities say. The agency later found out that 
he had also been working with criminals and feeding them information 
on ongoing investigations, even warning off at least one individual, 
according to authorities.

eh....l
~.\O ::t
e;~~~
s: ~ ~. 0
qs c::; ~ g
o t/J (Ii .,
::3 (1l Il:l
~ cil~ 0 2:
t:lHj~(1l
. ~ ~a
0~ ~ S'
N ("b t/J :s
Ot/JIl:l"-<:!
v'g::!t:O
-....c......
VI (:ll <' 0
:= - ~
< (1l ::3
(1l ~ '
t/J VJ ~
Pl
.....
....
(II
like image 475
JoshB Avatar asked Aug 18 '09 03:08

JoshB


2 Answers

One of the simpleset solutions(not involving regexpes):

#pseudopython

number_of_punct = sum([1 if c.ispunct() else 0 for c in line])

if number_of_punct >len(line)/2: line_is_garbage()

well. Or rude regexpish s/[!,'"@#~$%^& ]{5,}//g

like image 151
maykeye Avatar answered Oct 13 '22 01:10

maykeye


A simple heuristic, similar to anonymous answer:

listA = [0,1,2..9, a,b,c..z, A,B,C,..Z , ...] // alphanumerical symbols
listB = [!@$%^&...] // other symbols

Na = number_of_alphanumeric_symbols( line )
Nb = number_of_other_symbols( line )

if Na/Nb <= garbage_ratio then
  // garbage
like image 29
Nick Dandoulakis Avatar answered Oct 13 '22 03:10

Nick Dandoulakis