I am trying to extract the hyperlink information from a PDF using PDFBox but I am unsure how to get
for( Object p : pages ) {
PDPage page = (PDPage)p;
List<?> annotations = page.getAnnotations();
for( Object a : annotations ) {
PDAnnotation annotation = (PDAnnotation)a;
if( annotation instanceof PDAnnotationLink ) {
PDAnnotationLink link = (PDAnnotationLink)annotation;
System.out.println(link.toString());
System.out.println(link.getDestination());
}
}
}
I want to extract the url of the hyperlink destination and the text of the hyperlink. How can one do this?
Thanks
Bookmark this question. Show activity on this post. PDFbox is that PDFbox is the free version.
Use this code from the PrintURLs sample code from the source code download:
for( PDPage page : doc.getPages() )
{
pageNum++;
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
List<PDAnnotation> annotations = page.getAnnotations();
//first setup text extraction regions
for( int j=0; j<annotations.size(); j++ )
{
PDAnnotation annot = annotations.get(j);
if( annot instanceof PDAnnotationLink )
{
PDAnnotationLink link = (PDAnnotationLink)annot;
PDRectangle rect = link.getRectangle();
//need to reposition link rectangle to match text space
float x = rect.getLowerLeftX();
float y = rect.getUpperRightY();
float width = rect.getWidth();
float height = rect.getHeight();
int rotation = page.getRotation();
if( rotation == 0 )
{
PDRectangle pageSize = page.getMediaBox();
y = pageSize.getHeight() - y;
}
else if( rotation == 90 )
{
//do nothing
}
Rectangle2D.Float awtRect = new Rectangle2D.Float( x,y,width,height );
stripper.addRegion( "" + j, awtRect );
}
}
stripper.extractRegions( page );
for( int j=0; j<annotations.size(); j++ )
{
PDAnnotation annot = annotations.get(j);
if( annot instanceof PDAnnotationLink )
{
PDAnnotationLink link = (PDAnnotationLink)annot;
PDAction action = link.getAction();
String urlText = stripper.getTextForRegion( "" + j );
if( action instanceof PDActionURI )
{
PDActionURI uri = (PDActionURI)action;
System.out.println( "Page " + pageNum +":'" + urlText.trim() + "'=" + uri.getURI() );
}
}
}
}
It works in two parts, one is getting the URL which is easy, the other is getting the URL text, which is done with a text extraction at the rectangle of the annotation.
We must get hyperlink information and internal link(ex. move page....). I using code below:
int pageNum = 0;
for (PDPage page : originalPDF.getPages()) {
pageNum++;
List<PDAnnotation> annotations = page.getAnnotations();
for (PDAnnotation annot : annotations) {
if (annot instanceof PDAnnotationLink) {
// get dimension of annottations
PDAnnotationLink link = (PDAnnotationLink) annot;
// get link action include link url and internal link
PDAction action = link.getAction();
// get link internal some case specal
PDDestination pDestination = link.getDestination();
if (action != null) {
if (action instanceof PDActionURI || action instanceof PDActionGoTo) {
if (action instanceof PDActionURI) {
// get uri link
PDActionURI uri = (PDActionURI) action;
System.out.println("uri link:" + uri.getURI());
} else {
if (action instanceof PDActionGoTo) {
// get internal link
PDDestination destination = ((PDActionGoTo) action).getDestination();
PDPageDestination pageDestination;
if (destination instanceof PDPageDestination) {
pageDestination = (PDPageDestination) destination;
} else {
if (destination instanceof PDNamedDestination) {
pageDestination = originalPDF.getDocumentCatalog().findNamedDestinationPage((PDNamedDestination) destination);
} else {
// error handling
break;
}
}
if (pageDestination != null) {
System.out.println("page move: " + (pageDestination.retrievePageNumber() + 1));
}
}
}
}
} else {
if (pDestination != null) {
PDPageDestination pageDestination;
if (pDestination instanceof PDPageDestination) {
pageDestination = (PDPageDestination) pDestination;
} else {
if (pDestination instanceof PDNamedDestination) {
pageDestination = originalPDF.getDocumentCatalog().findNamedDestinationPage((PDNamedDestination) pDestination);
} else {
// error handling
break;
}
}
if (pageDestination != null) {
System.out.println("page move: " + (pageDestination.retrievePageNumber() + 1));
}
} else {
//
}
}
}
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With