I have scanned documents that weren't scanned perfectly straight so the text is not orientated perfectly horizontally, i.e. perhaps 10° of a slope on each line.
My understanding is that the deskew option in ImageMagick should solve this, for example
convert skewed_1500.jpeg -deskew 40% skewed_1500_not.jpg
but it doesn't have any noticeable effect on the output file.
I've attached the skewed and deskewed images for comparison.
First the original image:
Then the purportedly deskewed image:
I would try a bigger value like 80% otherwise an Imagemagick forum member has a bash script that may be better: http://www.fmwconcepts.com/imagemagick/textdeskew/index.php
You can also straighten the pages after first having ImageMagick convert your JPG to PDF (convert input.jpg input.pdf
) and then letting OCRmyPDF rectify the PDF:
ocrmypdf --deskew --tesseract-timeout=0 input.pdf output.pdf
Using your example page, I'd say the resulting text is straight:
As documented here, --tesseract-timeout=0
disables optical character recognition.
Of course you can also deskew the PDF and make it searchable in one go:
ocrmypdf --deskew -l fra input.pdf output.pdf
Make sure to have the French language pack from Tesseract installed before running this. Here are instructions.
To get rid of the black parts on the sides and the white part on the bottom of the PDF, you can use pdfcrop
(commonly part of TeX Live):
# Remove margins at left, top, right, and bottom
pdfcrop --margins '-60 0 -50 -430' output.pdf cropped_output.pdf
The cropped and deskewed PDF:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With