I just spent a few hours scouring the web. It seems others also have this issue, but I couldn't find an answer.
I have a whole bunch of PDF files that I need to get their measurements, namely their height and witdh of the pages content.
In Adobe Illustrator, when you import a PDF you have the option of triming to the "bounding box". That's exactly what I need.
I tried many approaches, here's the hodgepodge:
Dim pdfStream = IO.File.OpenRead(FilePath)
Dim img = PdfImages(pdfStream)
Dim pdfReader = New PdfReader(pdfStream)
Dim pdfDictionary = pdfReader.GetPageN(1)
Dim mediaBox = pdfDictionary.GetAsArray(PdfName.MEDIABOX)
Dim b = pdfReader.GetPageSize(pdfDictionary)
Dim ms = New MemoryStream
Dim document = New Document(pdfReader.GetPageSizeWithRotation(1))
Dim writer = PdfWriter.GetInstance(document, ms)
document.Open()
document.SetPageSize(pdfReader.GetPageSize(1))
document.NewPage()
Dim cb = writer.DirectContent
cb.Clip()
Dim pageImport = writer.GetImportedPage(pdfReader, 1)
pdfReader.Close()
pdfStream.Close()
All I manage to get is the page size, which is useless. I tried this on a whole bunch of PDFs, so it's not like one corrupt file or something.
To achieve your goal,
triming to the "bounding box". That's exactly what I need
you actually have to solve two problems:
Ad 1) change the crop boxes of the individual pages
You should not use the code you found for that task. Manipulating a single document almost always is best done using a PdfStamper,
not a PdfWriter.
The iText in Action — 2nd Edition sample CropPages.java / CropPages.cs shows how to do that. The central method:
public byte[] ManipulatePdf(byte[] src)
{
PdfReader reader = new PdfReader(src);
int n = reader.NumberOfPages;
PdfDictionary pageDict;
PdfRectangle rect = new PdfRectangle(55, 76, 560, 816);
for (int i = 1; i <= n; i++)
{
pageDict = reader.GetPageN(i);
pageDict.Put(PdfName.CROPBOX, rect);
}
using (MemoryStream ms = new MemoryStream())
{
using (PdfStamper stamper = new PdfStamper(reader, ms))
{
}
return ms.ToArray();
}
}
(The code works in memory, i.e. expects a byte[] and returns one, but can easily be revised to work in the file system.)
As you see, you actually manipulate the PDF as present in the PdfReader
and then only use the PdfStamper
to store the changed Pdf.
In your case, though, there is no fixed rectangle for all pages but instead you have to determine the rectangle for each page...
Ad 2) determine the bounding box of some page
To determine the bounding box you actually have to parse the whole page content and determine the dimensions of each drawn element.
Unfortunately iText(Sharp) supports this in a comfortable manner only up to a certain degree: It provides a content parsing framework, but this framework does not yet handle vector graphics out of the box.
The iText in Action — 2nd Edition sample ShowTextMargins.java / ShowTextMargins.cs shows how you can use that framework to determine the cropbox (vector graphics ignored). The essential code:
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
[...]
TextMarginFinder finder = parser.ProcessContent(i, new TextMarginFinder());
The finder
via finder.GetLlx(), finder.GetLly(), finder.GetUrx(),
and finder.GetUry()
after that ProcessContent
execution provides the coordinates of the lower left and upper right corners of the bounding box of page i
(vector graphics ignored). You can use these data to construct a rectangle with which to feed pageDict.Put(PdfName.CROPBOX, rect)
in the code above.
If you need to also take vector graphics into account, though, you'll have to extend the parser namespace classes somewhat to also create parsing events for vector graphics operators, and the TextMarginFinder
to also take those events into account. For more on this read this answer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With