Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Batch job submission error "Failed to process all documents", uris seem correct?

I've been trying to get Document AI batch submission working and having some difficulty. I have single file submission working using RawDocument and suppose I could just iterate over my data set (27k images) but chose batch since it seems like the more appropriate technique.

When I run my code I am seeing an error: "Failed to process all documents". The first few lines of the debug information are:

O:17:"Google\Rpc\Status":5:{ s:7:"*code";i:3;s:10:"*message";s:32:"Failed to process all documents."; s:26:"Google\Rpc\Statusdetails"; O:38:"Google\Protobuf\Internal\RepeatedField":4:{ s:49:"Google\Protobuf\Internal\RepeatedFieldcontainer";a:0:{}s:44:"Google\Protobuf\Internal\RepeatedFieldtype";i:11;s:45:"Google\Protobuf\Internal\RepeatedFieldklass";s:19:"Google\Protobuf\Any";s:52:"Google\Protobuf\Internal\RepeatedFieldlegacy_klass";s:19:"Google\Protobuf\Any";}s:38:"Google\Protobuf\Internal\Messagedesc";O:35:"Google\Protobuf\Internal\Descriptor":13:{s:46:"Google\Protobuf\Internal\Descriptorfull_name";s:17:"google.rpc.Status";s:42:"Google\Protobuf\Internal\Descriptorfield";a:3:{i:1;O:40:"Google\Protobuf\Internal\FieldDescriptor":14:{s:46:"Google\Protobuf\Internal\FieldDescriptorname";s:4:"code";```

The support for this error states that the reason for the error is:

The gcsUriPrefix and gcsOutputConfig.gcsUri parameters need to begin with gs:// and end with a trailing backslash character (/). Check the configuration for the Bucket URIs.

I am not using gcsUriPrefix (should I? My buckets > max batch limit) but my gcsOutputConfig.gcsUri is within these limits. The file list I've provided gives file names (pointed at the right bucket) so should not have a trailing backslash.

Advice welcome

    function filesFromBucket( $directoryPrefix ) {
        // NOT recursive, does not search the structure
        $gcsDocumentList = [];
    
        // see https://cloud.google.com/storage/docs/samples/storage-list-files-with-prefix
        $bucketName = 'my-input-bucket';
        $storage = new StorageClient();
        $bucket = $storage->bucket($bucketName);
        $options = ['prefix' => $directoryPrefix];
        foreach ($bucket->objects($options) as $object) {
            $doc = new GcsDocument();
            $doc->setGcsUri('gs://'.$object->name());
            $doc->setMimeType($object->info()['contentType']);
            array_push( $gcsDocumentList, $doc );
        }
    
        $gcsDocuments = new GcsDocuments();
        $gcsDocuments->setDocuments($gcsDocumentList);
        return $gcsDocuments;
    }
    
    function batchJob ( ) {
        $inputConfig = new BatchDocumentsInputConfig( ['gcs_documents'=>filesFromBucket('the-bucket-path/')] );
    
        // see https://cloud.google.com/php/docs/reference/cloud-document-ai/latest/V1.DocumentOutputConfig
        // nb: all uri paths must end with / or an error will be generated.
        $outputConfig = new DocumentOutputConfig( 
            [ 'gcs_output_config' =>
                   new GcsOutputConfig( ['gcs_uri'=>'gs://my-output-bucket/'] ) ]
        );
     
        // see https://cloud.google.com/php/docs/reference/cloud-document-ai/latest/V1.DocumentProcessorServiceClient
        $documentProcessorServiceClient = new DocumentProcessorServiceClient();
        try {
            // derived from the prediction endpoint
            $name = 'projects/######/locations/us/processors/#######';
            $operationResponse = $documentProcessorServiceClient->batchProcessDocuments($name, ['inputDocuments'=>$inputConfig, 'documentOutputConfig'=>$outputConfig]);
            $operationResponse->pollUntilComplete();
            if ($operationResponse->operationSucceeded()) {
                $result = $operationResponse->getResult();
                printf('<br>result: %s<br>',serialize($result));
            // doSomethingWith($result)
            } else {
                $error = $operationResponse->getError();
                printf('<br>error: %s<br>', serialize($error));
                // handleError($error)
            }
        } finally {
            $documentProcessorServiceClient->close();
        }    
    }
like image 719
Stephen Avatar asked Jan 22 '26 10:01

Stephen


1 Answers

This turns out to be an ID-10-T error, with definite PEBKAC overtones.

$object->name() does not return the bucket name as part of the path.

Changing $doc->setGcsUri('gs://'.$object->name()); to $doc->setGcsUri('gs://'.$bucketName.'/'.$object->name()); resolves the issue.

like image 198
Stephen Avatar answered Jan 24 '26 23:01

Stephen



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!