Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

itext get content size

I just spent a few hours scouring the web. It seems others also have this issue, but I couldn't find an answer.

I have a whole bunch of PDF files that I need to get their measurements, namely their height and witdh of the pages content.

In Adobe Illustrator, when you import a PDF you have the option of triming to the "bounding box". That's exactly what I need.

I tried many approaches, here's the hodgepodge:

Dim pdfStream = IO.File.OpenRead(FilePath)
Dim img = PdfImages(pdfStream)
Dim pdfReader = New PdfReader(pdfStream)
Dim pdfDictionary = pdfReader.GetPageN(1)
Dim mediaBox = pdfDictionary.GetAsArray(PdfName.MEDIABOX)
Dim b = pdfReader.GetPageSize(pdfDictionary)
Dim ms = New MemoryStream
Dim document = New Document(pdfReader.GetPageSizeWithRotation(1))
Dim writer = PdfWriter.GetInstance(document, ms)
document.Open()
document.SetPageSize(pdfReader.GetPageSize(1))
document.NewPage()
Dim cb = writer.DirectContent
cb.Clip()
Dim pageImport = writer.GetImportedPage(pdfReader, 1)
pdfReader.Close()
pdfStream.Close()

All I manage to get is the page size, which is useless. I tried this on a whole bunch of PDFs, so it's not like one corrupt file or something.

like image 264
Yisroel M. Olewski Avatar asked Jul 26 '13 10:07

Yisroel M. Olewski


1 Answers

To achieve your goal,

triming to the "bounding box". That's exactly what I need

you actually have to solve two problems:

  1. You have to change the crop boxes of the individual pages of some PDF document.
  2. You have to determine the bounding box of some page, i.e. (as I assume) the smallest box (with horizontal and vertical sides) containing all visible content of a page.

Ad 1) change the crop boxes of the individual pages

You should not use the code you found for that task. Manipulating a single document almost always is best done using a PdfStamper, not a PdfWriter.

The iText in Action — 2nd Edition sample CropPages.java / CropPages.cs shows how to do that. The central method:

public byte[] ManipulatePdf(byte[] src)
{
  PdfReader reader = new PdfReader(src);
  int n = reader.NumberOfPages;
  PdfDictionary pageDict;
  PdfRectangle rect = new PdfRectangle(55, 76, 560, 816);
  for (int i = 1; i <= n; i++)
  {
    pageDict = reader.GetPageN(i);
    pageDict.Put(PdfName.CROPBOX, rect);
  }
  using (MemoryStream ms = new MemoryStream())
  {
    using (PdfStamper stamper = new PdfStamper(reader, ms))
    {
    }
    return ms.ToArray();
  }
}

(The code works in memory, i.e. expects a byte[] and returns one, but can easily be revised to work in the file system.)

As you see, you actually manipulate the PDF as present in the PdfReader and then only use the PdfStamper to store the changed Pdf.

In your case, though, there is no fixed rectangle for all pages but instead you have to determine the rectangle for each page...

Ad 2) determine the bounding box of some page

To determine the bounding box you actually have to parse the whole page content and determine the dimensions of each drawn element.

Unfortunately iText(Sharp) supports this in a comfortable manner only up to a certain degree: It provides a content parsing framework, but this framework does not yet handle vector graphics out of the box.

The iText in Action — 2nd Edition sample ShowTextMargins.java / ShowTextMargins.cs shows how you can use that framework to determine the cropbox (vector graphics ignored). The essential code:

PdfReaderContentParser parser = new PdfReaderContentParser(reader);
[...]
TextMarginFinder finder = parser.ProcessContent(i, new TextMarginFinder());

The finder via finder.GetLlx(), finder.GetLly(), finder.GetUrx(), and finder.GetUry() after that ProcessContent execution provides the coordinates of the lower left and upper right corners of the bounding box of page i (vector graphics ignored). You can use these data to construct a rectangle with which to feed pageDict.Put(PdfName.CROPBOX, rect) in the code above.

If you need to also take vector graphics into account, though, you'll have to extend the parser namespace classes somewhat to also create parsing events for vector graphics operators, and the TextMarginFinder to also take those events into account. For more on this read this answer.

like image 71
mkl Avatar answered Jan 01 '23 12:01

mkl