Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PDFBOX: Merge adds unused Fonts, how to remove it

Tags:

java

pdfbox

i merge two PDF Files into one with PDFBOX Version 2. The First one got Fonts:

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
XXMGEM+Arial-BoldMT                  TrueType          WinAnsi          yes yes yes     15  0
XXMGEM+ArialMT                       TrueType          WinAnsi          yes yes yes     19  0
XXMGEM+ArialMT                       CID TrueType      Identity-H       yes yes yes     27  0
XXMGEM+ArialNarrow-Bold              TrueType          WinAnsi          yes yes yes     40  0
XXMGEM+ArialNarrow                   TrueType          WinAnsi          yes yes yes     44  0

and the Second one:

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
UNTWVR+HelveticaLTCom-Roman          CID TrueType      Identity-H       yes yes yes     25  0
UNTYID+HelveticaLTCom-Bold           CID TrueType      Identity-H       yes yes yes     26  0
UNTZUP+ArialMT                       CID TrueType      Identity-H       yes yes yes     27  0
UNUBHB+Arial-BoldMT                  CID TrueType      Identity-H       yes yes yes     28  0
Helvetica-Bold                       Type 1            WinAnsi          no  no  no      29  0
UNXPUH+HelveticaLTCom-Roman          CID TrueType      Identity-H       yes yes yes     50  0
UNXRGT+HelveticaLTCom-Bold           CID TrueType      Identity-H       yes yes yes     51  0
UNXSTF+ArialMT                       CID TrueType      Identity-H       yes yes yes     52  0
UNXUFR+Arial-BoldMT                  CID TrueType      Identity-H       yes yes yes     53  0

After Merging, this happens:

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
SRWYVL+HelveticaLTCom-Roman          CID TrueType      Identity-H       yes yes yes    420  0
SRXAHX+HelveticaLTCom-Bold           CID TrueType      Identity-H       yes yes yes    421  0
SRXBUJ+ArialMT                       CID TrueType      Identity-H       yes yes yes    422  0
SRXDGV+Arial-BoldMT                  CID TrueType      Identity-H       yes yes yes    423  0
Helvetica-Bold                       Type 1            WinAnsi          no  no  no     424  0
SRWYVL+HelveticaLTCom-Roman          CID TrueType      Identity-H       yes yes yes    425  0
SRXAHX+HelveticaLTCom-Bold           CID TrueType      Identity-H       yes yes yes    426  0
SRXBUJ+ArialMT                       CID TrueType      Identity-H       yes yes yes    427  0
SRXDGV+Arial-BoldMT                  CID TrueType      Identity-H       yes yes yes    428  0
SRWYVL+ArialMT                       CID TrueType      Identity-H       yes yes yes    429  0
SRXAHX+HelveticaLTCom-Roman          CID TrueType      Identity-H       yes yes yes    430  0
SRXBUJ+HelveticaLTCom-Bold           CID TrueType      Identity-H       yes yes yes    431  0
SRXDGV+Arial-BoldMT                  CID TrueType      Identity-H       yes yes yes    432  0
WDEGAT+Arial-BoldMT                  TrueType          WinAnsi          yes yes yes    436  0
GSEDXU+ArialMT                       TrueType          WinAnsi          yes yes yes    437  0
Arial                                TrueType          WinAnsi          yes no  no     416  0
ZapfDingbats                         TrueType          WinAnsi          yes no  yes    419  0
ArialNarrow                          TrueType          WinAnsi          yes no  no     417  0
ACHRDX+ZapfDingbats                  TrueType          WinAnsi          yes yes yes    618  0
ACHRDX+ZapfDingbats                  TrueType          WinAnsi          yes yes yes    619  0
ACHRDX+ZapfDingbats                  TrueType          WinAnsi          yes yes yes    620  0
ACHRDX+ZapfDingbats                  TrueType          WinAnsi          yes yes yes    621  0
ACHRDX+ZapfDingbats                  TrueType          WinAnsi          yes yes yes    622  0
GSEDXU+ArialNarrow-Bold              TrueType          WinAnsi          yes yes yes    560  0
NVGLHQ+ArialNarrow                   TrueType          WinAnsi          yes yes yes    561  0
KWHHMM+ArialMT                       CID TrueType      Identity-H       yes yes yes    578  0

My Code in Java:

final PDFMergerUtility pdfMerger = new PDFMergerUtility();
            pdfMerger.setDestinationStream(outputStream);
            pdfMerger.addSources(additionalPdfStreams);
            pdfMerger.addSource(inputStreamPdDocument);
            pdfMerger.mergeDocuments(MemoryUsageSetting.setupMainMemoryOnly());

The Problem is that an Api from a third party vendor got an Problem with this Fonts. So : What am i doing wrong and how can i remove the unused and doubled fonts ??

like image 573
Skary Avatar asked Nov 15 '18 09:11

Skary


People also ask

Does PDFBox use log4j?

Minimum Requirements. The main PDFBox component, pdfbox, has a hard dependency on the commons-logging library. Commons Logging is a generic wrapper around different logging frameworks, so you'll either need to also use a logging library like log4j or let commons-logging fall back to the standard java. util.

What is PDDocument in Java?

PDDocument() Creates an empty PDF document. PDDocument(COSDocument doc) Constructor that uses an existing document. PDDocument(COSDocument doc, RandomAccessRead source)


1 Answers

The "duplication" issue seems like it's coming from multiple pages, because each page contains its own font metadata. If you iterate over the pages and get the font names, then you will see duplicates in the output if a font is used in more than one page.

Something seems very wrong with the details in the question though. Neither of the source files have ZapfDingbats font, so where did it come from into the merged document?

First, I wrote a couple of helper methods:

static String mergePdfs(InputStream is1, InputStream is2) throws IOException {
    PDFMergerUtility pdfMerger = new PDFMergerUtility();
    pdfMerger.addSource(is1);
    pdfMerger.addSource(is2);

    String destFile = System.getProperty("java.io.tmpdir") + System.nanoTime() + ".pdf";
    pdfMerger.setDestinationFileName(destFile);
    pdfMerger.mergeDocuments(MemoryUsageSetting.setupMainMemoryOnly());

    return destFile;
}

static List<String> getFontNames(PDDocument doc) throws IOException {
    List<String> result = new ArrayList<>();
    for (int i=0; i < doc.getNumberOfPages(); i++){
        PDPage page = doc.getPage(i);
        PDResources res = page.getResources();
        for (COSName fontName : res.getFontNames()) {
            result.add(res.getFont(fontName).toString());
        }
    }

    return result;
}

Then I created 3 test PDF documents. The first 2, test-pdf-1.pdf and test-pdf-2.pdf contain one page each and use the same two fonts: PDTrueTypeFont BAAAAA+ArialMT and PDTrueTypeFont CAAAAA+Roboto-Black. The 3rd one, test-pdf-3.pdf, contains 2 pages from the first two documents, and was created with a text editor and not with PDFBox.

And then added the following test code:

Class clazz = Test.class;
String src1, src2, src3;
src1 = "/test-pdf-1.pdf";
src2 = "/test-pdf-2.pdf";
src3 = "/test-pdf-3.pdf";

InputStream is1, is2, is3;
is1 = clazz.getResourceAsStream(src1);
is2 = clazz.getResourceAsStream(src2);

String merged = mergePdfs(is1, is2);

PDDocument doc1, doc2, doc3, doc4;

is1 = clazz.getResourceAsStream(src1);
doc1 = PDDocument.load(is1);

is2 = clazz.getResourceAsStream(src2);
doc2 = PDDocument.load(is2);

is3 = clazz.getResourceAsStream(src3);
doc3 = PDDocument.load(is3);

doc4 = PDDocument.load(new File(merged));

System.out.println(src1 + " >\n\t" + getFontNames(doc1));
System.out.println(src2 + " >\n\t" + getFontNames(doc2));
System.out.println(src3 + " >\n\t" + getFontNames(doc3));
System.out.println(merged  + " >\n\t" + getFontNames(doc4));

The output is as follows (I truncated the last file name for readability and easier comparison):

/test-pdf-1.pdf >
[PDTrueTypeFont BAAAAA+ArialMT, PDTrueTypeFont CAAAAA+Roboto-Black]
/test-pdf-2.pdf >
[PDTrueTypeFont BAAAAA+ArialMT, PDTrueTypeFont CAAAAA+Roboto-Black]
/test-pdf-3.pdf >
[PDTrueTypeFont BAAAAA+ArialMT, PDTrueTypeFont CAAAAA+Roboto-Black, PDTrueTypeFont BAAAAA+ArialMT, PDTrueTypeFont CAAAAA+Roboto-Black]
C:\Temp\..9.pdf >
[PDTrueTypeFont BAAAAA+ArialMT, PDTrueTypeFont CAAAAA+Roboto-Black, PDTrueTypeFont BAAAAA+ArialMT, PDTrueTypeFont CAAAAA+Roboto-Black]

You can see that both the file created by PDFBox's merge, "C:\temp\7193671804393899.pdf" (abbreviated in the output for readability), and the file "test-pdf-3.pdf" which was created with an editor have the same output for fonts, showing each font twice, one for each page.

Opening the merged file in Acrobat Reader confirms that only one copy of the fonts exists:

C:\temp\7193671804393899.pdf Properties > Fonts

like image 98
isapir Avatar answered Sep 26 '22 14:09

isapir