I have a big number pdf documents with xml files attached to them. I would like to extract those attached xml files and read them. How can I do this programatically using .net?
Extracting data to ExcelYou can extract data from PDF files directly into Excel. First, you'll need to import your PDF file. Once you import the file, use the extract data button to begin the extraction process.
Right-click the selected image, and select Extract Image. The Save As dialog appears. Browse to a location, and specify a filename for the image. Save as type: Specify an image file format: JPG, TIF or PNG.
Method 5 – Adobe Photoshop Open Photoshop and open the PDF file as you normally open an image file. The PDF import dialog will appear automatically. Click on the Images radio button and then select the images you want to open inside Photoshop. Use the SHIFT key to select more than one image.
iTextSharp is also quite capable of extracting attachments... Though you might have to use the low level objects to do so.
There are two ways to embed files in a PDF:
Once you have a file specification dictionary from either source, the file itself will be a stream within the dictionary labeled "EF" (embedded file).
So to list all the files at the document level, one would write code (in Java) as such:
Map<String, byte[]> files = new HashMap<String,byte[]>();
PdfReader reader = new PdfReader(pdfPath);
PdfDictionary root = reader.getCatalog();
PdfDictionary names = root.getAsDict(PdfName.NAMES); // may be null
PdfDictionary embeddedFilesDict = names.getAsDict(PdfName.EMBEDDEDFILES); //may be null
PdfArray embeddedFiles = embeddedFilesDict.getAsArray(PdfName.NAMES); // may be null
int len = embeddedFiles.size();
for (int i = 0; i < len; i += 2) {
PdfString name = embeddedFiles.getAsString(i); // should always be present
PdfDictionary fileSpec = embeddedFiles.getAsDict(i+1); // ditto
PdfDictionary streams = fileSpec.getAsDict(PdfName.EF);
PRStream stream = null;
if (streams.contains(PdfName.UF))
stream = (PRStream)streams.getAsStream(PdfName.UF);
else
stream = (PRStream)streams.getAsStream(PdfName.F); // Default stream for backwards compatibility
if (stream != null) {
files.put( name.toUnicodeString(), PdfReader.getStreamBytes((PRStream)stream));
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With