Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pdfbox. how to fetch fields of all kind of pdf forms

Tags:

java

pdf

pdfbox

xfa

I am able to fetch the field names for most of the pdf files using pdfbox but i am not able to fetch fields income taxform. is it something restricted in that form.

though it contains multiple fields in the form, it is showing only one field.

This is the output:

topmostSubform[0].

my code:

PDDocumentCatalog docCatalog = pdfDocument.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
List fields = acroForm.getFields();

@SuppressWarnings("rawtypes")
java.util.Iterator fieldsIter = fields.iterator();
System.out.println(new Integer(fields.size()).toString());
while( fieldsIter.hasNext())
{
    PDField field = (PDField)fieldsIter.next();
    System.out.println(field.getFullyQualifiedName());
    System.out.println(field.getPartialName());
}

used in

public static void main(String[] args) throws IOException {
    PDDocument pdDoc = null;
    try {
        pdDoc = PDDocument.load("income.pdf");
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace(); 
    }
    Ggdfgdgdgf feilds = new Ggdfgdgdgf();
    feilds.printFields(pdDoc);
}
like image 647
Baswa Prasad Avatar asked Dec 24 '22 08:12

Baswa Prasad


1 Answers

The PDF in question is a hybrid AcroForm/XFA form. This means that it contains the form definition both in AcroForm and in XFA format.

PDFBox primarily supports AcroForm (which is the PDF form technology presented in the PDF specification), but as both formats are present, PDFBox can at least inspect the AcroForm form definition.

Your code ignores that AcroForm.getFields() does not return all field definitions but merely the definitions of the root fields, cf. the JavaDoc comments:

/**
 * This will return all of the documents root fields.
 * 
 * A field might have children that are fields (non-terminal field) or does not
 * have children which are fields (terminal fields).
 * 
 * The fields within an AcroForm are organized in a tree structure. The documents root fields 
 * might either be terminal fields, non-terminal fields or a mixture of both. Non-terminal fields
 * mark branches which contents can be retrieved using {@link PDNonTerminalField#getChildren()}.
 * 
 * @return A list of the documents root fields.
 * 
 */
public List<PDField> getFields()

If you want to access all fields, you have to walk the form field tree, e.g. like this:

public void test() throws IOException
{
    try (   InputStream resource = getClass().getResourceAsStream("f2290.pdf"))
    {
        PDDocument pdfDocument = PDDocument.load(resource);
        PDDocumentCatalog docCatalog = pdfDocument.getDocumentCatalog();
        PDAcroForm acroForm = docCatalog.getAcroForm();
        List<PDField> fields = acroForm.getFields();
        for (PDField field : fields)
        {
            list(field);
        }
    }
}

void list(PDField field)
{
    System.out.println(field.getFullyQualifiedName());
    System.out.println(field.getPartialName());
    if (field instanceof PDNonTerminalField)
    {
        PDNonTerminalField nonTerminalField = (PDNonTerminalField) field;
        for (PDField child : nonTerminalField.getChildren())
        {
            list(child);
        }
    }
}

This returns a huge list of fields for your document.

PS: You have not stated which PDFBox version you use. As currently PDFBox development clearly has begun recommending the use of the current 2.0.0 release candidates, I assumed in my answer that you use that version.

like image 81
mkl Avatar answered Jan 14 '23 14:01

mkl