Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract hyperlink information PDFBox

I am trying to extract the hyperlink information from a PDF using PDFBox but I am unsure how to get

for( Object p : pages ) {
    PDPage page = (PDPage)p;

    List<?> annotations = page.getAnnotations();
    for( Object a : annotations ) {
        PDAnnotation annotation = (PDAnnotation)a;

        if( annotation instanceof PDAnnotationLink ) {
            PDAnnotationLink link = (PDAnnotationLink)annotation;
            System.out.println(link.toString());
            System.out.println(link.getDestination());

        }
    }

}

I want to extract the url of the hyperlink destination and the text of the hyperlink. How can one do this?

Thanks

like image 363
kabeersvohra Avatar asked Jul 26 '16 10:07

kabeersvohra


People also ask

Is PDFBox free to use?

Bookmark this question. Show activity on this post. PDFbox is that PDFbox is the free version.


2 Answers

Use this code from the PrintURLs sample code from the source code download:

for( PDPage page : doc.getPages() )
{
    pageNum++;
    PDFTextStripperByArea stripper = new PDFTextStripperByArea();
    List<PDAnnotation> annotations = page.getAnnotations();
    //first setup text extraction regions
    for( int j=0; j<annotations.size(); j++ )
    {
        PDAnnotation annot = annotations.get(j);
        if( annot instanceof PDAnnotationLink )
        {
            PDAnnotationLink link = (PDAnnotationLink)annot;
            PDRectangle rect = link.getRectangle();
            //need to reposition link rectangle to match text space
            float x = rect.getLowerLeftX();
            float y = rect.getUpperRightY();
            float width = rect.getWidth();
            float height = rect.getHeight();
            int rotation = page.getRotation();
            if( rotation == 0 )
            {
                PDRectangle pageSize = page.getMediaBox();
                y = pageSize.getHeight() - y;
            }
            else if( rotation == 90 )
            {
                //do nothing
            }

            Rectangle2D.Float awtRect = new Rectangle2D.Float( x,y,width,height );
            stripper.addRegion( "" + j, awtRect );
        }
    }

    stripper.extractRegions( page );

    for( int j=0; j<annotations.size(); j++ )
    {
        PDAnnotation annot = annotations.get(j);
        if( annot instanceof PDAnnotationLink )
        {
            PDAnnotationLink link = (PDAnnotationLink)annot;
            PDAction action = link.getAction();
            String urlText = stripper.getTextForRegion( "" + j );
            if( action instanceof PDActionURI )
            {
                PDActionURI uri = (PDActionURI)action;
                System.out.println( "Page " + pageNum +":'" + urlText.trim() + "'=" + uri.getURI() );
            }
        }
    }
}

It works in two parts, one is getting the URL which is easy, the other is getting the URL text, which is done with a text extraction at the rectangle of the annotation.

like image 72
Tilman Hausherr Avatar answered Oct 07 '22 18:10

Tilman Hausherr


We must get hyperlink information and internal link(ex. move page....). I using code below:

int pageNum = 0;
            for (PDPage page : originalPDF.getPages()) {
                pageNum++;
                List<PDAnnotation> annotations = page.getAnnotations();
                for (PDAnnotation annot : annotations) {
                    if (annot instanceof PDAnnotationLink) {
                        // get dimension of annottations
                        PDAnnotationLink link = (PDAnnotationLink) annot;
                        // get link action include link url and internal link
                        PDAction action = link.getAction();
                        // get link internal some case specal
                        PDDestination pDestination = link.getDestination();

                        if (action != null) {
                            if (action instanceof PDActionURI || action instanceof PDActionGoTo) {
                                if (action instanceof PDActionURI) {
                                    // get uri link
                                    PDActionURI uri = (PDActionURI) action;
                                    System.out.println("uri link:" + uri.getURI());
                                } else {
                                    if (action instanceof PDActionGoTo) {
                                        // get internal link
                                        PDDestination destination = ((PDActionGoTo) action).getDestination();
                                        PDPageDestination pageDestination;
                                        if (destination instanceof PDPageDestination) {
                                            pageDestination = (PDPageDestination) destination;
                                        } else {
                                            if (destination instanceof PDNamedDestination) {
                                                pageDestination = originalPDF.getDocumentCatalog().findNamedDestinationPage((PDNamedDestination) destination);
                                            } else {
                                                // error handling
                                                break;
                                            }
                                        }

                                        if (pageDestination != null) {
                                            System.out.println("page move: " + (pageDestination.retrievePageNumber() + 1));
                                        }
                                    }
                                }
                            }
                        } else {
                            if (pDestination != null) {
                                PDPageDestination pageDestination;
                                if (pDestination instanceof PDPageDestination) {
                                    pageDestination = (PDPageDestination) pDestination;
                                } else {
                                    if (pDestination instanceof PDNamedDestination) {
                                        pageDestination = originalPDF.getDocumentCatalog().findNamedDestinationPage((PDNamedDestination) pDestination);
                                    } else {
                                        // error handling
                                        break;
                                    }
                                }

                                if (pageDestination != null) {
                                    System.out.println("page move: " + (pageDestination.retrievePageNumber() + 1));
                                }
                            } else {
                                //    
                            }
                        }
                    }
                }

            }
like image 36
Adam Avatar answered Oct 07 '22 16:10

Adam