Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

removing noise from document images

I'm working on a project to automatically process scanned invoices. In order get a better result the OCR engine, I'd like to first remove noise from images. Beside scratches, I'd also like to remove anything that was added to document after it was printed. Many invoice e.g. were ticked off and sometimes it makes parts of the invoice unreadable for OCR.

For example have a look at this image. The description of the second item won't be readable and I'd like to remove "noise" like that.

So how can I remove handwritten regions like that and still maintain a high quality of the printed text beneath?

like image 313
Pedro Avatar asked Nov 16 '11 20:11

Pedro


2 Answers

Scratches and other spots can fairly easily be filtered by just ignoring any pixels that are not of at least a certain colour intensity.

You have three options for dealing with the lines:

  1. First important question, are the hand writing written in a different colour? A simple solution would be to give everyone blue or red pens and ban black pens from being used. You can then scan the documents in colour, then you can easily just use the Green buffer as your grey-scale image instead of all three buffer. That would be the easiest way to implement this, almost every scanner now days supports colour scanning.

  2. Otherwise you're going to have to write an algorithm that can detect lines within an image, for this to work, you would need to first calibrate the algorithm to first know what is the size of a character usually, then find any lines that are longer than X pixels, then remove the line from there. This is going to be very problematic and not going to work too well for you, and you will spend a long time trying to get it working and it still will never be 100%.

  3. The other way is that after doing your OCR you should present your data to an end user to verify it's correct, you could then present them with the scanned image and allow them to overwrite what was scanned if it was incorrect.

Of the three options I would say your best option is just to prevent people from writing on invoices with a black pen. If you can't do that, then scan the document best you can and provide it to an end user to clarify the problematic fields (you could even flag them as being a problem so the user doesn't need to check the whole document all the time).

Edit: one thing that is worth pointing out, is that if you are receiving documents that have been written on and then faxed, you're not going to be able to do very much with them other than option 3 (try your best and then present to a user).

like image 97
Seph Avatar answered Nov 04 '22 19:11

Seph


This is a complicated signal processing task and would require a sophisticated algorithm which makes use of some of the qualities that distinguish handwritten notes from printed text (for example, the width of the marks, the curvature of handwritten notes as compared to printed text, or maybe even the shade of ink).

Probably more information than you're looking for, but you could even train a learning algorithm to filter out unwanted marks.

like image 34
mattgately Avatar answered Nov 04 '22 19:11

mattgately