Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to deskew a scanned text page with ImageMagick?

I have scanned documents that weren't scanned perfectly straight so the text is not orientated perfectly horizontally, i.e. perhaps 10° of a slope on each line.

My understanding is that the deskew option in ImageMagick should solve this, for example

convert skewed_1500.jpeg -deskew 40% skewed_1500_not.jpg

but it doesn't have any noticeable effect on the output file.

I've attached the skewed and deskewed images for comparison.

First the original image: skewed image

Then the purportedly deskewed image: deskewed image

like image 684
carbontracking Avatar asked Jan 09 '17 10:01

carbontracking


2 Answers

I would try a bigger value like 80% otherwise an Imagemagick forum member has a bash script that may be better: http://www.fmwconcepts.com/imagemagick/textdeskew/index.php

like image 123
Bonzo Avatar answered Oct 24 '22 04:10

Bonzo


with OCRmyPDF

You can also straighten the pages after first having ImageMagick convert your JPG to PDF (convert input.jpg input.pdf) and then letting OCRmyPDF rectify the PDF:

ocrmypdf --deskew --tesseract-timeout=0 input.pdf output.pdf

Using your example page, I'd say the resulting text is straight:

straightened page, after running OCRmyPDF

As documented here, --tesseract-timeout=0 disables optical character recognition.

Of course you can also deskew the PDF and make it searchable in one go:

ocrmypdf --deskew -l fra input.pdf output.pdf

Make sure to have the French language pack from Tesseract installed before running this. Here are instructions.

Crop the PDF

To get rid of the black parts on the sides and the white part on the bottom of the PDF, you can use pdfcrop (commonly part of TeX Live):

# Remove margins at left, top, right, and bottom
pdfcrop --margins '-60 0 -50 -430' output.pdf cropped_output.pdf

The cropped and deskewed PDF:

PDF cropped with pdfcrop

like image 34
Matthias Braun Avatar answered Oct 24 '22 05:10

Matthias Braun