i merge two PDF Files into one with PDFBOX Version 2. The First one got Fonts:
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
XXMGEM+Arial-BoldMT TrueType WinAnsi yes yes yes 15 0
XXMGEM+ArialMT TrueType WinAnsi yes yes yes 19 0
XXMGEM+ArialMT CID TrueType Identity-H yes yes yes 27 0
XXMGEM+ArialNarrow-Bold TrueType WinAnsi yes yes yes 40 0
XXMGEM+ArialNarrow TrueType WinAnsi yes yes yes 44 0
and the Second one:
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
UNTWVR+HelveticaLTCom-Roman CID TrueType Identity-H yes yes yes 25 0
UNTYID+HelveticaLTCom-Bold CID TrueType Identity-H yes yes yes 26 0
UNTZUP+ArialMT CID TrueType Identity-H yes yes yes 27 0
UNUBHB+Arial-BoldMT CID TrueType Identity-H yes yes yes 28 0
Helvetica-Bold Type 1 WinAnsi no no no 29 0
UNXPUH+HelveticaLTCom-Roman CID TrueType Identity-H yes yes yes 50 0
UNXRGT+HelveticaLTCom-Bold CID TrueType Identity-H yes yes yes 51 0
UNXSTF+ArialMT CID TrueType Identity-H yes yes yes 52 0
UNXUFR+Arial-BoldMT CID TrueType Identity-H yes yes yes 53 0
After Merging, this happens:
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
SRWYVL+HelveticaLTCom-Roman CID TrueType Identity-H yes yes yes 420 0
SRXAHX+HelveticaLTCom-Bold CID TrueType Identity-H yes yes yes 421 0
SRXBUJ+ArialMT CID TrueType Identity-H yes yes yes 422 0
SRXDGV+Arial-BoldMT CID TrueType Identity-H yes yes yes 423 0
Helvetica-Bold Type 1 WinAnsi no no no 424 0
SRWYVL+HelveticaLTCom-Roman CID TrueType Identity-H yes yes yes 425 0
SRXAHX+HelveticaLTCom-Bold CID TrueType Identity-H yes yes yes 426 0
SRXBUJ+ArialMT CID TrueType Identity-H yes yes yes 427 0
SRXDGV+Arial-BoldMT CID TrueType Identity-H yes yes yes 428 0
SRWYVL+ArialMT CID TrueType Identity-H yes yes yes 429 0
SRXAHX+HelveticaLTCom-Roman CID TrueType Identity-H yes yes yes 430 0
SRXBUJ+HelveticaLTCom-Bold CID TrueType Identity-H yes yes yes 431 0
SRXDGV+Arial-BoldMT CID TrueType Identity-H yes yes yes 432 0
WDEGAT+Arial-BoldMT TrueType WinAnsi yes yes yes 436 0
GSEDXU+ArialMT TrueType WinAnsi yes yes yes 437 0
Arial TrueType WinAnsi yes no no 416 0
ZapfDingbats TrueType WinAnsi yes no yes 419 0
ArialNarrow TrueType WinAnsi yes no no 417 0
ACHRDX+ZapfDingbats TrueType WinAnsi yes yes yes 618 0
ACHRDX+ZapfDingbats TrueType WinAnsi yes yes yes 619 0
ACHRDX+ZapfDingbats TrueType WinAnsi yes yes yes 620 0
ACHRDX+ZapfDingbats TrueType WinAnsi yes yes yes 621 0
ACHRDX+ZapfDingbats TrueType WinAnsi yes yes yes 622 0
GSEDXU+ArialNarrow-Bold TrueType WinAnsi yes yes yes 560 0
NVGLHQ+ArialNarrow TrueType WinAnsi yes yes yes 561 0
KWHHMM+ArialMT CID TrueType Identity-H yes yes yes 578 0
My Code in Java:
final PDFMergerUtility pdfMerger = new PDFMergerUtility();
pdfMerger.setDestinationStream(outputStream);
pdfMerger.addSources(additionalPdfStreams);
pdfMerger.addSource(inputStreamPdDocument);
pdfMerger.mergeDocuments(MemoryUsageSetting.setupMainMemoryOnly());
The Problem is that an Api from a third party vendor got an Problem with this Fonts. So : What am i doing wrong and how can i remove the unused and doubled fonts ??
Minimum Requirements. The main PDFBox component, pdfbox, has a hard dependency on the commons-logging library. Commons Logging is a generic wrapper around different logging frameworks, so you'll either need to also use a logging library like log4j or let commons-logging fall back to the standard java. util.
PDDocument() Creates an empty PDF document. PDDocument(COSDocument doc) Constructor that uses an existing document. PDDocument(COSDocument doc, RandomAccessRead source)
The "duplication" issue seems like it's coming from multiple pages, because each page contains its own font metadata. If you iterate over the pages and get the font names, then you will see duplicates in the output if a font is used in more than one page.
Something seems very wrong with the details in the question though. Neither of the source files have ZapfDingbats
font, so where did it come from into the merged document?
First, I wrote a couple of helper methods:
static String mergePdfs(InputStream is1, InputStream is2) throws IOException {
PDFMergerUtility pdfMerger = new PDFMergerUtility();
pdfMerger.addSource(is1);
pdfMerger.addSource(is2);
String destFile = System.getProperty("java.io.tmpdir") + System.nanoTime() + ".pdf";
pdfMerger.setDestinationFileName(destFile);
pdfMerger.mergeDocuments(MemoryUsageSetting.setupMainMemoryOnly());
return destFile;
}
static List<String> getFontNames(PDDocument doc) throws IOException {
List<String> result = new ArrayList<>();
for (int i=0; i < doc.getNumberOfPages(); i++){
PDPage page = doc.getPage(i);
PDResources res = page.getResources();
for (COSName fontName : res.getFontNames()) {
result.add(res.getFont(fontName).toString());
}
}
return result;
}
Then I created 3 test PDF documents. The first 2, test-pdf-1.pdf
and test-pdf-2.pdf
contain one page each and use the same two fonts: PDTrueTypeFont BAAAAA+ArialMT
and PDTrueTypeFont CAAAAA+Roboto-Black
. The 3rd one, test-pdf-3.pdf
, contains 2 pages from the first two documents, and was created with a text editor and not with PDFBox.
And then added the following test code:
Class clazz = Test.class;
String src1, src2, src3;
src1 = "/test-pdf-1.pdf";
src2 = "/test-pdf-2.pdf";
src3 = "/test-pdf-3.pdf";
InputStream is1, is2, is3;
is1 = clazz.getResourceAsStream(src1);
is2 = clazz.getResourceAsStream(src2);
String merged = mergePdfs(is1, is2);
PDDocument doc1, doc2, doc3, doc4;
is1 = clazz.getResourceAsStream(src1);
doc1 = PDDocument.load(is1);
is2 = clazz.getResourceAsStream(src2);
doc2 = PDDocument.load(is2);
is3 = clazz.getResourceAsStream(src3);
doc3 = PDDocument.load(is3);
doc4 = PDDocument.load(new File(merged));
System.out.println(src1 + " >\n\t" + getFontNames(doc1));
System.out.println(src2 + " >\n\t" + getFontNames(doc2));
System.out.println(src3 + " >\n\t" + getFontNames(doc3));
System.out.println(merged + " >\n\t" + getFontNames(doc4));
The output is as follows (I truncated the last file name for readability and easier comparison):
/test-pdf-1.pdf >
[PDTrueTypeFont BAAAAA+ArialMT, PDTrueTypeFont CAAAAA+Roboto-Black]
/test-pdf-2.pdf >
[PDTrueTypeFont BAAAAA+ArialMT, PDTrueTypeFont CAAAAA+Roboto-Black]
/test-pdf-3.pdf >
[PDTrueTypeFont BAAAAA+ArialMT, PDTrueTypeFont CAAAAA+Roboto-Black, PDTrueTypeFont BAAAAA+ArialMT, PDTrueTypeFont CAAAAA+Roboto-Black]
C:\Temp\..9.pdf >
[PDTrueTypeFont BAAAAA+ArialMT, PDTrueTypeFont CAAAAA+Roboto-Black, PDTrueTypeFont BAAAAA+ArialMT, PDTrueTypeFont CAAAAA+Roboto-Black]
You can see that both the file created by PDFBox's merge, "C:\temp\7193671804393899.pdf" (abbreviated in the output for readability), and the file "test-pdf-3.pdf" which was created with an editor have the same output for fonts, showing each font twice, one for each page.
Opening the merged file in Acrobat Reader confirms that only one copy of the fonts exists:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With