Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PDFBox: How to modify page and save changes to a new file (e.g. remove link annotation)?

Tags:

java

pdf

pdfbox

I need to clean up PDF document from link annotations. Here is a code template I have:

    public static void main(String[] args) throws IOException, COSVisitorException {
    try (PDDocument doc = PDDocument.load("input.pdf")) {
        final List<PDPage> pages = doc.getDocumentCatalog().getAllPages();
        for (PDPage page : pages) {
            List<PDAnnotation> annotations = page.getAnnotations();
            for (PDAnnotation ann : annotations) {
                if (ann instanceof PDAnnotationLink) {
                    PDAnnotationLink link = (PDAnnotationLink) ann;
                    PDAction action = link.getAction();
                    if (action instanceof PDActionURI) {
                        final PDActionURI linkUri = (PDActionURI) action;
                        if (linkUri.getURI().contains("www.example.com")) {
                            // TODO remove the link
                        }
                    }
                }
            }
        }
        doc.save("output.pdf");
    }
}

But I couldn't find a way to remove links permanently and save these changes to a new file, links are still there.

How can I save page modifications?

like image 758
andrew Avatar asked Nov 09 '22 17:11

andrew


1 Answers

Recently I had the similar task. Maybe this answer will save some time for someone.

In the code snippet below I used PDFBox 2.0.4.

Well, you can remove any annotation from document just by removing it from annotations list which was obtained by calling page.getAnnotations() method. Tricky thing here is that you cannot do it by reference. For example, you could iterate over all annotations, collect those that are should be removed from the document and then call annotations.removeAll(shouldBeRemoved). But this way there is no garanty that unwanted annotations will be actually removed from the document. Annotation objects returned by page.getAnnotations() method may not be exactly the same annotation objects that are held in the page. Reliable way to remove annotations from list is removing them by index:

List<PDAnnotation> annotations = page.getAnnotations();
for (int i = 0; i < annotations.size();) {
    PDAnnotation annotation = annotations.get(i++);
    if (annotation instanceof PDAnnotationLink) {
        PDAnnotationLink link = (PDAnnotationLink) annotation;
        PDAction action = link.getAction();
        if (action instanceof PDActionURI) {
            PDActionURI uriAction = (PDActionURI) action;
            String uri = uriAction.getURI();
            if (uri != null && uri.contains("<some_text>"))
                annotations.remove(--i);
        }
    }
}

P.S. As @mkl pointed out it may not be sufficient to remove links from the document. In this case you should parse content of page and rewrite it excluding tokens related to the text that need to be removed from the document.

like image 82
briarheart Avatar answered Nov 14 '22 22:11

briarheart