Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I extract attachments from a pdf file?

Tags:

c#

.net

pdf

I have a big number pdf documents with xml files attached to them. I would like to extract those attached xml files and read them. How can I do this programatically using .net?

like image 479
gyurisc Avatar asked Jun 10 '11 11:06

gyurisc


People also ask

Can you extract files from a PDF?

Extracting data to ExcelYou can extract data from PDF files directly into Excel. First, you'll need to import your PDF file. Once you import the file, use the extract data button to begin the extraction process.

Is there a way to extract images from a PDF?

Right-click the selected image, and select Extract Image. The Save As dialog appears. Browse to a location, and specify a filename for the image. Save as type: Specify an image file format: JPG, TIF or PNG.

How do I extract multiple images from a PDF?

Method 5 – Adobe Photoshop Open Photoshop and open the PDF file as you normally open an image file. The PDF import dialog will appear automatically. Click on the Images radio button and then select the images you want to open inside Photoshop. Use the SHIFT key to select more than one image.


1 Answers

iTextSharp is also quite capable of extracting attachments... Though you might have to use the low level objects to do so.

There are two ways to embed files in a PDF:

  1. In a File Annotation
  2. At the document level "EmbeddedFiles".

Once you have a file specification dictionary from either source, the file itself will be a stream within the dictionary labeled "EF" (embedded file).

So to list all the files at the document level, one would write code (in Java) as such:

Map<String, byte[]> files = new HashMap<String,byte[]>();

PdfReader reader = new PdfReader(pdfPath);
PdfDictionary root = reader.getCatalog();
PdfDictionary names = root.getAsDict(PdfName.NAMES); // may be null
PdfDictionary embeddedFilesDict = names.getAsDict(PdfName.EMBEDDEDFILES); //may be null
PdfArray embeddedFiles = embeddedFilesDict.getAsArray(PdfName.NAMES); // may be null

int len = embeddedFiles.size();
for (int i = 0; i < len; i += 2) {
  PdfString name = embeddedFiles.getAsString(i); // should always be present
  PdfDictionary fileSpec = embeddedFiles.getAsDict(i+1); // ditto

  PdfDictionary streams = fileSpec.getAsDict(PdfName.EF);
  PRStream stream = null;

  if (streams.contains(PdfName.UF))
    stream = (PRStream)streams.getAsStream(PdfName.UF);
  else
    stream = (PRStream)streams.getAsStream(PdfName.F); // Default stream for backwards compatibility

  if (stream != null) {
    files.put( name.toUnicodeString(), PdfReader.getStreamBytes((PRStream)stream));
  }
}
like image 151
Mark Storer Avatar answered Oct 10 '22 16:10

Mark Storer