I have a large collection of documents scanned into PDF format, and I wish to write a shell script that will convert each document to DjVu format. Some documents were scanned at 200dpi, some at 300dpi, and some at 600dpi. Since DjVu is a pixel-based format, I want to be sure I use the same resolution in the target DjVu file as was used for the scan.
Does anyone know what program I can run, or how I can write a program, to determine what resolution was used to produce a scanned PDF? (Number of pixels might work too as almost all documents are 8.5 by 11 inches.)
Clarification after responses: I'm aware of the difficulties highlighted by Breton, and I'm willing to concede that the problem in general is ill-posed, but I'm not asking about general PDF documents. My particular documents came out of a scanner. They contain one scanned image per page, same resolution each page. If I convert the PDF to PostScript I can poke around by hand and find pixel dimensions easily; I could probably find image sizes with more work. And if in desperate need I could modify the dictionary stack that gs
is using; long ago, I wrote an interpreter for PostScript Level 1.
All of that is what I'm trying to avoid.
Thanks to help received, I've posted an answer below:
identify
, taking only the output for the first page, and understanding that the units will be PostScript points, of which there are 72 to an inch.pdfimages
.identify
will give number of pixels.Full answer with script is below. I'm using it in live fire and it works great. Thanks Harlequin for pdfimages
and Spiffeah for the alert about multiple images per page (it's rare, but I've found some).
If a pdf has been created by scanning then there should only be one image associated with each page. You can find each image resolution for each page image by parsing the pdf using the iText(Java) or iTextSharp(the .net port) libraries easily.
If you want to roll your own utility to do this, do something like the following in iTextSharp :
PdfReader reader = new PdfReader(filename);
for (int i = 1; i <= reader.NumberOfPages; i++)
{
PdfDictionary pg = reader.GetPageN(i);
PdfDictionary res = (PdfDictionary)PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES));
PdfDictionary xobjs = (PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT));
if (xobjs != null)
{
foreach (PdfName xObjectKey in xobjs.Keys)
{
PdfObject xobj = xobjs.Get(xObjectKey);
PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(xobj);
PdfName subtype = (PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));
if (subtype.Equals(PdfName.IMAGE))
{
PdfNumber width = (PdfNumber)tg.Get(PdfName.WIDTH);
PdfNumber height = (PdfNumber)tg.Get(PdfName.HEIGHT);
MessageBox.Show("image on page [" + i + "] resolution=[" + width +"x" + height + "]");
}
}
}
}
reader.Close();
Here for each page we read through each XObject of subtype Image and get the WIDTH and HEIGHT values. This will be the pixel resolution of the image that the scanner has embedded in the pdf.
Note that the scaling of this image to match the page resolution (as in the size of the page rendered in Acrobat - A4, Letter, etc) is performed separately in the page content stream, which is represented as a subset of postscript, and much harder to find without parsing the postscript.
Be aware that there are some scanners that will embed the scanned image as a grid of smaller images (for some kind of size optimization I assume). So if you see something like 50 small images popping up for each page, that could be why.
Hope this helps in some way if you have to roll your own utility.
pdfimages
has a -list
option that gives the height width in pixels and also y-ppi
and x-ppi
.
pdfimages -list tmp.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 3300 2550 gray 1 1 ccitt no 477 0 389 232 172K 17%
2 1 image 3300 2550 gray 1 1 ccitt no 3 0 389 232 103K 10%
3 2 image 3300 2550 gray 1 1 ccitt no 7 0 389 232 236K 23%
4 3 image 3300 2550 gray 1 1 ccitt no 11 0 389 232 210K 20%
5 4 image 3300 2550 gray 1 1 ccitt no 15 0 389 232 250K 24%
6 5 image 3300 2550 gray 1 1 ccitt no 19 0 389 232 199K 19%
7 6 image 3300 2550 gray 1 1 ccitt no 23 0 389 232 503K 49%
8 7 image 3300 2550 gray 1 1 ccitt no 27 0 389 232 154K 15%
9 8 image 3300 2550 gray 1 1 ccitt no 31 0 389 232 21.5K 2.1%
10 9 image 3300 2550 gray 1 1 ccitt no 35 0 389 232 286K 28%
11 10 image 3300 2550 gray 1 1 ccitt no 39 0 389 232 46.8K 4.6%
12 11 image 3300 2550 gray 1 1 ccitt no 43 0 389 232 55.5K 5.4%
13 12 image 3300 2550 gray 1 1 ccitt no 47 0 389 232 35.0K 3.4%
14 13 image 3300 2550 gray 1 1 ccitt no 51 0 389 232 26.9K 2.6%
15 14 image 3300 2550 gray 1 1 ccitt no 55 0 389 232 66.5K 6.5%
16 15 image 3300 2550 gray 1 1 ccitt no 59 0 389 232 73.9K 7.2%
17 16 image 3300 2550 gray 1 1 ccitt no 63 0 389 232 47.0K 4.6%
18 17 image 3300 2550 gray 1 1 ccitt no 67 0 389 232 30.1K 2.9%
19 18 image 3300 2550 gray 1 1 ccitt no 71 0 389 232 70.3K 6.8%
20 19 image 3300 2550 gray 1 1 ccitt no 75 0 389 232 46.0K 4.5%
21 20 image 3300 2550 gray 1 1 ccitt no 79 0 389 232 28.9K 2.8%
22 21 image 3300 2550 gray 1 1 ccitt no 83 0 389 232 72.7K 7.1%
23 22 image 3300 2550 gray 1 1 ccitt no 87 0 389 232 47.5K 4.6%
24 23 image 3300 2550 gray 1 1 ccitt no 91 0 389 232 30.1K 2.9%
I guess that the scans are included as images in the PDF, so you could use pdfimages
to extract them first. Then, identify
should be able to find the correct data.
Too long to put into a comment, but neither ImageMagick nor GraphicsMagic is up to the job; every answer is wrong:
: nr@yorkie 1932 ; gm identify -format "x=%x y=%y w=%w h=%h" drh*rec*pdf
x=0 y=0 w=612 h=792
x=0 y=0 w=612 h=792
x=0 y=0 w=612 h=792
x=0 y=0 w=612 h=792
x=0 y=0 w=612 h=792
x=0 y=0 w=612 h=792
x=0 y=0 w=612 h=792
x=0 y=0 w=612 h=792
: nr@yorkie 1933 ; identify -format "x=%x y=%y w=%w h=%h" drh*rec*pdf
x=72 Undefined y=72 Undefined w=612 h=792x=72 Undefined y=72 Undefined w=612 h=792x=72 Undefined y=72 Undefined w=612 h=792x=72 Undefined y=72 Undefined w=612 h=792x=72 Undefined y=72 Undefined w=612 h=792x=72 Undefined y=72 Undefined w=612 h=792x=72 Undefined y=72 Undefined w=612 h=792x=72 Undefined y=72 Undefined w=612 h=792
: nr@yorkie 1934 ;
The correct parameters for this document is that each scanned page is 5100 pixels wide and 6600 pixels high, unsurprising for this was an 8.5-by-11 scanned at 600dpi. The output from ImageMagic is astoundingly unprofessional.
No downvotes because you were trying to be helpful, but *Magick
don't work.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With