PDFBOX: Merge adds unused Fonts, how to remove it

Tags:

pdfbox

i merge two PDF Files into one with PDFBOX Version 2. The First one got Fonts:

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
XXMGEM+Arial-BoldMT                  TrueType          WinAnsi          yes yes yes     15  0
XXMGEM+ArialMT                       TrueType          WinAnsi          yes yes yes     19  0
XXMGEM+ArialMT                       CID TrueType      Identity-H       yes yes yes     27  0
XXMGEM+ArialNarrow-Bold              TrueType          WinAnsi          yes yes yes     40  0
XXMGEM+ArialNarrow                   TrueType          WinAnsi          yes yes yes     44  0

and the Second one:

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
UNTWVR+HelveticaLTCom-Roman          CID TrueType      Identity-H       yes yes yes     25  0
UNTYID+HelveticaLTCom-Bold           CID TrueType      Identity-H       yes yes yes     26  0
UNTZUP+ArialMT                       CID TrueType      Identity-H       yes yes yes     27  0
UNUBHB+Arial-BoldMT                  CID TrueType      Identity-H       yes yes yes     28  0
Helvetica-Bold                       Type 1            WinAnsi          no  no  no      29  0
UNXPUH+HelveticaLTCom-Roman          CID TrueType      Identity-H       yes yes yes     50  0
UNXRGT+HelveticaLTCom-Bold           CID TrueType      Identity-H       yes yes yes     51  0
UNXSTF+ArialMT                       CID TrueType      Identity-H       yes yes yes     52  0
UNXUFR+Arial-BoldMT                  CID TrueType      Identity-H       yes yes yes     53  0

After Merging, this happens:

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
SRWYVL+HelveticaLTCom-Roman          CID TrueType      Identity-H       yes yes yes    420  0
SRXAHX+HelveticaLTCom-Bold           CID TrueType      Identity-H       yes yes yes    421  0
SRXBUJ+ArialMT                       CID TrueType      Identity-H       yes yes yes    422  0
SRXDGV+Arial-BoldMT                  CID TrueType      Identity-H       yes yes yes    423  0
Helvetica-Bold                       Type 1            WinAnsi          no  no  no     424  0
SRWYVL+HelveticaLTCom-Roman          CID TrueType      Identity-H       yes yes yes    425  0
SRXAHX+HelveticaLTCom-Bold           CID TrueType      Identity-H       yes yes yes    426  0
SRXBUJ+ArialMT                       CID TrueType      Identity-H       yes yes yes    427  0
SRXDGV+Arial-BoldMT                  CID TrueType      Identity-H       yes yes yes    428  0
SRWYVL+ArialMT                       CID TrueType      Identity-H       yes yes yes    429  0
SRXAHX+HelveticaLTCom-Roman          CID TrueType      Identity-H       yes yes yes    430  0
SRXBUJ+HelveticaLTCom-Bold           CID TrueType      Identity-H       yes yes yes    431  0
SRXDGV+Arial-BoldMT                  CID TrueType      Identity-H       yes yes yes    432  0
WDEGAT+Arial-BoldMT                  TrueType          WinAnsi          yes yes yes    436  0
GSEDXU+ArialMT                       TrueType          WinAnsi          yes yes yes    437  0
Arial                                TrueType          WinAnsi          yes no  no     416  0
ZapfDingbats                         TrueType          WinAnsi          yes no  yes    419  0
ArialNarrow                          TrueType          WinAnsi          yes no  no     417  0
ACHRDX+ZapfDingbats                  TrueType          WinAnsi          yes yes yes    618  0
ACHRDX+ZapfDingbats                  TrueType          WinAnsi          yes yes yes    619  0
ACHRDX+ZapfDingbats                  TrueType          WinAnsi          yes yes yes    620  0
ACHRDX+ZapfDingbats                  TrueType          WinAnsi          yes yes yes    621  0
ACHRDX+ZapfDingbats                  TrueType          WinAnsi          yes yes yes    622  0
GSEDXU+ArialNarrow-Bold              TrueType          WinAnsi          yes yes yes    560  0
NVGLHQ+ArialNarrow                   TrueType          WinAnsi          yes yes yes    561  0
KWHHMM+ArialMT                       CID TrueType      Identity-H       yes yes yes    578  0

My Code in Java:

final PDFMergerUtility pdfMerger = new PDFMergerUtility();
            pdfMerger.setDestinationStream(outputStream);
            pdfMerger.addSources(additionalPdfStreams);
            pdfMerger.addSource(inputStreamPdDocument);
            pdfMerger.mergeDocuments(MemoryUsageSetting.setupMainMemoryOnly());

The Problem is that an Api from a third party vendor got an Problem with this Fonts. So : What am i doing wrong and how can i remove the unused and doubled fonts ??

573

asked Nov 15 '18 09:11

1 Answers

The "duplication" issue seems like it's coming from multiple pages, because each page contains its own font metadata. If you iterate over the pages and get the font names, then you will see duplicates in the output if a font is used in more than one page.

Something seems very wrong with the details in the question though. Neither of the source files have ZapfDingbats font, so where did it come from into the merged document?

First, I wrote a couple of helper methods:

static String mergePdfs(InputStream is1, InputStream is2) throws IOException {
    PDFMergerUtility pdfMerger = new PDFMergerUtility();
    pdfMerger.addSource(is1);
    pdfMerger.addSource(is2);

    String destFile = System.getProperty("java.io.tmpdir") + System.nanoTime() + ".pdf";
    pdfMerger.setDestinationFileName(destFile);
    pdfMerger.mergeDocuments(MemoryUsageSetting.setupMainMemoryOnly());

    return destFile;
}

static List<String> getFontNames(PDDocument doc) throws IOException {
    List<String> result = new ArrayList<>();
    for (int i=0; i < doc.getNumberOfPages(); i++){
        PDPage page = doc.getPage(i);
        PDResources res = page.getResources();
        for (COSName fontName : res.getFontNames()) {
            result.add(res.getFont(fontName).toString());
        }
    }

    return result;
}

Then I created 3 test PDF documents. The first 2, test-pdf-1.pdf and test-pdf-2.pdf contain one page each and use the same two fonts: PDTrueTypeFont BAAAAA+ArialMT and PDTrueTypeFont CAAAAA+Roboto-Black. The 3rd one, test-pdf-3.pdf, contains 2 pages from the first two documents, and was created with a text editor and not with PDFBox.

And then added the following test code:

Class clazz = Test.class;
String src1, src2, src3;
src1 = "/test-pdf-1.pdf";
src2 = "/test-pdf-2.pdf";
src3 = "/test-pdf-3.pdf";

InputStream is1, is2, is3;
is1 = clazz.getResourceAsStream(src1);
is2 = clazz.getResourceAsStream(src2);

String merged = mergePdfs(is1, is2);

PDDocument doc1, doc2, doc3, doc4;

is1 = clazz.getResourceAsStream(src1);
doc1 = PDDocument.load(is1);

is2 = clazz.getResourceAsStream(src2);
doc2 = PDDocument.load(is2);

is3 = clazz.getResourceAsStream(src3);
doc3 = PDDocument.load(is3);

doc4 = PDDocument.load(new File(merged));

System.out.println(src1 + " >\n\t" + getFontNames(doc1));
System.out.println(src2 + " >\n\t" + getFontNames(doc2));
System.out.println(src3 + " >\n\t" + getFontNames(doc3));
System.out.println(merged  + " >\n\t" + getFontNames(doc4));

The output is as follows (I truncated the last file name for readability and easier comparison):

/test-pdf-1.pdf >
[PDTrueTypeFont BAAAAA+ArialMT, PDTrueTypeFont CAAAAA+Roboto-Black]
/test-pdf-2.pdf >
[PDTrueTypeFont BAAAAA+ArialMT, PDTrueTypeFont CAAAAA+Roboto-Black]
/test-pdf-3.pdf >
[PDTrueTypeFont BAAAAA+ArialMT, PDTrueTypeFont CAAAAA+Roboto-Black, PDTrueTypeFont BAAAAA+ArialMT, PDTrueTypeFont CAAAAA+Roboto-Black]
C:\Temp\..9.pdf >
[PDTrueTypeFont BAAAAA+ArialMT, PDTrueTypeFont CAAAAA+Roboto-Black, PDTrueTypeFont BAAAAA+ArialMT, PDTrueTypeFont CAAAAA+Roboto-Black]

You can see that both the file created by PDFBox's merge, "C:\temp\7193671804393899.pdf" (abbreviated in the output for readability), and the file "test-pdf-3.pdf" which was created with an editor have the same output for fonts, showing each font twice, one for each page.

Opening the merged file in Acrobat Reader confirms that only one copy of the fonts exists:

$C:\temp\7193671804393899.pdf Properties > Fonts$

answered Sep 26 '22 14:09

isapir

Related questions
                            
                                Calling execution.execute() twice in RestTemplate interceptor
                            
                                How to organize import in Eclipse but not *change* star imports
                            
                                Effect of HttpUrlConnection.setChunkedStreamingMode
                            
                                Why liquibase unable to resolved the db.changelog classpath?
                            
                                java.lang.IllegalArgumentException: Some fields are missing (optional or mandatory)
                            
                                Using RequestOptions in AppGlideModule with Glide 4
                            
                                Do Kotlin 1.2.10 and Java 9 have opposite rules regarding automatic modules?
                            
                                Why does static field self assignment compile only with explicit static syntax?
                            
                                JPA Table per Class Inheritance with different Id names
                            
                                How to check if a service is running on Android 8 (API 26)?
                            
                                AES-256-CTR Encryption in node JS and decryption in Java
                            
                                In constructor method references, difference between using generic type parameters and not?
                            
                                Why use ConcurrentLinkedQueue when we have LinkedBlockingQueue?
                            
                                Spring Data REST HATEOS : not lazy loading
                            
                                how modify the response body with java filter?
                            
                                What does @enablesns @enablesqs annotations do (spring cloud aws)?
                            
                                AuthenticationProcessingFilter and WebSecurityConfigurerAdapter causing circular dependency
                            
                                Java 8 - throw multiple generic checked exceptions in lambda
                            
                                Java volatile loop
                            
                                What does "Static factories returned object need not exist" mean?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PDFBOX: Merge adds unused Fonts, how to remove it

Tags:

java

pdfbox

Skary

People also ask

1 Answers

isapir

Recent Activity

Donate For Us