How to extract hyperlink information PDFBox

Tags:

I am trying to extract the hyperlink information from a PDF using PDFBox but I am unsure how to get

for( Object p : pages ) {
    PDPage page = (PDPage)p;

    List<?> annotations = page.getAnnotations();
    for( Object a : annotations ) {
        PDAnnotation annotation = (PDAnnotation)a;

        if( annotation instanceof PDAnnotationLink ) {
            PDAnnotationLink link = (PDAnnotationLink)annotation;
            System.out.println(link.toString());
            System.out.println(link.getDestination());

        }
    }

}

I want to extract the url of the hyperlink destination and the text of the hyperlink. How can one do this?

Thanks

363

asked Jul 26 '16 10:07

kabeersvohra

2 Answers

Use this code from the PrintURLs sample code from the source code download:

for( PDPage page : doc.getPages() )
{
    pageNum++;
    PDFTextStripperByArea stripper = new PDFTextStripperByArea();
    List<PDAnnotation> annotations = page.getAnnotations();
    //first setup text extraction regions
    for( int j=0; j<annotations.size(); j++ )
    {
        PDAnnotation annot = annotations.get(j);
        if( annot instanceof PDAnnotationLink )
        {
            PDAnnotationLink link = (PDAnnotationLink)annot;
            PDRectangle rect = link.getRectangle();
            //need to reposition link rectangle to match text space
            float x = rect.getLowerLeftX();
            float y = rect.getUpperRightY();
            float width = rect.getWidth();
            float height = rect.getHeight();
            int rotation = page.getRotation();
            if( rotation == 0 )
            {
                PDRectangle pageSize = page.getMediaBox();
                y = pageSize.getHeight() - y;
            }
            else if( rotation == 90 )
            {
                //do nothing
            }

            Rectangle2D.Float awtRect = new Rectangle2D.Float( x,y,width,height );
            stripper.addRegion( "" + j, awtRect );
        }
    }

    stripper.extractRegions( page );

    for( int j=0; j<annotations.size(); j++ )
    {
        PDAnnotation annot = annotations.get(j);
        if( annot instanceof PDAnnotationLink )
        {
            PDAnnotationLink link = (PDAnnotationLink)annot;
            PDAction action = link.getAction();
            String urlText = stripper.getTextForRegion( "" + j );
            if( action instanceof PDActionURI )
            {
                PDActionURI uri = (PDActionURI)action;
                System.out.println( "Page " + pageNum +":'" + urlText.trim() + "'=" + uri.getURI() );
            }
        }
    }
}

It works in two parts, one is getting the URL which is easy, the other is getting the URL text, which is done with a text extraction at the rectangle of the annotation.

answered Oct 07 '22 18:10

Tilman Hausherr

We must get hyperlink information and internal link(ex. move page....). I using code below:

int pageNum = 0;
            for (PDPage page : originalPDF.getPages()) {
                pageNum++;
                List<PDAnnotation> annotations = page.getAnnotations();
                for (PDAnnotation annot : annotations) {
                    if (annot instanceof PDAnnotationLink) {
                        // get dimension of annottations
                        PDAnnotationLink link = (PDAnnotationLink) annot;
                        // get link action include link url and internal link
                        PDAction action = link.getAction();
                        // get link internal some case specal
                        PDDestination pDestination = link.getDestination();

                        if (action != null) {
                            if (action instanceof PDActionURI || action instanceof PDActionGoTo) {
                                if (action instanceof PDActionURI) {
                                    // get uri link
                                    PDActionURI uri = (PDActionURI) action;
                                    System.out.println("uri link:" + uri.getURI());
                                } else {
                                    if (action instanceof PDActionGoTo) {
                                        // get internal link
                                        PDDestination destination = ((PDActionGoTo) action).getDestination();
                                        PDPageDestination pageDestination;
                                        if (destination instanceof PDPageDestination) {
                                            pageDestination = (PDPageDestination) destination;
                                        } else {
                                            if (destination instanceof PDNamedDestination) {
                                                pageDestination = originalPDF.getDocumentCatalog().findNamedDestinationPage((PDNamedDestination) destination);
                                            } else {
                                                // error handling
                                                break;
                                            }
                                        }

                                        if (pageDestination != null) {
                                            System.out.println("page move: " + (pageDestination.retrievePageNumber() + 1));
                                        }
                                    }
                                }
                            }
                        } else {
                            if (pDestination != null) {
                                PDPageDestination pageDestination;
                                if (pDestination instanceof PDPageDestination) {
                                    pageDestination = (PDPageDestination) pDestination;
                                } else {
                                    if (pDestination instanceof PDNamedDestination) {
                                        pageDestination = originalPDF.getDocumentCatalog().findNamedDestinationPage((PDNamedDestination) pDestination);
                                    } else {
                                        // error handling
                                        break;
                                    }
                                }

                                if (pageDestination != null) {
                                    System.out.println("page move: " + (pageDestination.retrievePageNumber() + 1));
                                }
                            } else {
                                //    
                            }
                        }
                    }
                }

            }

answered Oct 07 '22 16:10

Adam

Related questions
                            
                                Is there a way to use SecondaryTable to jump multiple tables?
                            
                                error occured instantiating job to be executed in Quartz sheduler
                            
                                How can I evaluate next statement when null was returned in Java?
                            
                                GSON deserialization with generic types and generic field names
                            
                                How to properly handle expected errors in Hystrix fallback?
                            
                                What is the difference between ServerBootstrap.option() and ServerBootstrap.childOption() in netty 4.x
                            
                                What is the meaning of @jls in javadoc?
                            
                                Using grpc in maven
                            
                                Register multiple Instances of a Spring Boot Eureka Client from a single host
                            
                                How to access AuditReaderFactory in spring boot application?
                            
                                Spring controller method called twice
                            
                                why use JndiObjectFactoryBean to config JNDI datasource did not work?
                            
                                How to capture thread dump programatically using JAVA Code?
                            
                                SpringBoot, how to Authenticate with LDAP without using ldif?
                            
                                Java 8 - Ternary operator returning function doesn't compile
                            
                                Spring Data: Not an managed type: class java.lang.Object
                            
                                java.lang.NoClassDefFoundError: ayc for InterstitialAd
                            
                                Apache-camel: Enabling bridgeEndpoint on the http endpoint
                            
                                How Spring Singleton Scope is Garbage Collected?
                            
                                Authenticate only selected rest end points : spring boot

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to extract hyperlink information PDFBox

Tags:

java

text

hyperlink

pdf

pdfbox

kabeersvohra

People also ask

2 Answers

Tilman Hausherr

Adam

Recent Activity

Donate For Us