PDFBox

Question

Need to check if PDF Tags have properties as per Accessibility guidelines. Examples:

H1 - validate that a H1 exists in the PDF
Image(Figure Tag) - validate image\figure has a Alt text
Language - Validate that language property is set so that screen reader will read properly. For Spanish and English documents, respective Language codes should be updated
Tables - access table object and validate that table structure is proper (headers columns match with row column etc)

So far I was able to:

Extract the Metadata and validate the document has proper Title, Subject and Producer info by PDDocument.getDocumentInformation().getMetadataKeys();
Validate if PDF is accessible or not by checking PDDocument.getDocumentCatalog().getMarkInfo().isMarked(); flag

To access the Tags, I have tried these options:

getDocumentCatalog().getAcroForm() returns Null
PDDocument.getDocumentCatalog().getPages().get(0).getAnnotations(); returns Null
I tried looping through PDDocument.getDocumentCatalog().getStructureTreeRoot().getKids() but its returning only 1 StructElem type object

Creation of Accessible PDF is done using OpenText so Dev team doesn't know about PDFBox. I am lost here as how to get the access to Tags/Objects (use MarkedContent or something else).

Please suggest how to extract the individual objects(tags) such as P, H1, Table, Figure/Image and validate their properties. Note: Manual validation of these properties are performed using Adobe Acrobat Pro

Monte Chan · Accepted Answer

Based upon https://issues.apache.org/jira/browse/PDFBOX-7, it appears that you can use PDFMarkedContentExtractor to get the information that you need.

PDFBox - Accessible PDF - How to check if PDF Tags have properties as per Accessiblity guidelines

Tags:

java

pdf

accessibility

Sachin G

1 Answers

Monte Chan

Recent Activity

Donate For Us