Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PDFBox - Accessible PDF - How to check if PDF Tags have properties as per Accessiblity guidelines

Need to check if PDF Tags have properties as per Accessibility guidelines. Examples:

  • H1 - validate that a H1 exists in the PDF
  • Image(Figure Tag) - validate image\figure has a Alt text
  • Language - Validate that language property is set so that screen reader will read properly. For Spanish and English documents, respective Language codes should be updated
  • Tables - access table object and validate that table structure is proper (headers columns match with row column etc)

So far I was able to:

  • Extract the Metadata and validate the document has proper Title, Subject and Producer info by PDDocument.getDocumentInformation().getMetadataKeys();
  • Validate if PDF is accessible or not by checking PDDocument.getDocumentCatalog().getMarkInfo().isMarked(); flag

To access the Tags, I have tried these options:

  • getDocumentCatalog().getAcroForm() returns Null
  • PDDocument.getDocumentCatalog().getPages().get(0).getAnnotations(); returns Null
  • I tried looping through PDDocument.getDocumentCatalog().getStructureTreeRoot().getKids() but its returning only 1 StructElem type object

Creation of Accessible PDF is done using OpenText so Dev team doesn't know about PDFBox. I am lost here as how to get the access to Tags/Objects (use MarkedContent or something else).

Please suggest how to extract the individual objects(tags) such as P, H1, Table, Figure/Image and validate their properties. Note: Manual validation of these properties are performed using Adobe Acrobat Pro

like image 506
Sachin G Avatar asked Oct 27 '22 22:10

Sachin G


1 Answers

Based upon https://issues.apache.org/jira/browse/PDFBOX-7, it appears that you can use PDFMarkedContentExtractor to get the information that you need.

like image 196
Monte Chan Avatar answered Nov 02 '22 09:11

Monte Chan