Efficient way to extract files and meta data from Amazon S3?

Tags:

2 Answers

I know this is a bit old, but I encountered this problem and to solve it I extended the Aws sdk to use the batch functionality for this type of problem. It makes a lot quicker to retrieve custom meta data for lots of files. This is my code:

    /**
     * Name: Steves_Amazon_S3
     * 
     * Extends the AmazonS3 class in order to create a function to 
     * more efficiently retrieve a list of
     * files and their custom metadata using the CFBatchRequest function.
     * 
     * 
     */
    class Steves_Amazon_S3 extends AmazonS3 {

        public function get_object_metadata_batch($bucket, $filenames, $opt = null) {
            $batch = new CFBatchRequest();

            foreach ($filenames as $filename) {

                $this->batch($batch)->get_object_headers($bucket, $filename); // Get content-type
            }

            $response = $this->batch($batch)->send();

            // Fail if any requests were unsuccessful
            if (!$response->areOK()) {
                return false;
            }
            foreach ($response as $file) {
                $temp = array();
                $temp['name'] = (string) basename($file->header['_info']['url']);
                $temp['etag'] = (string) basename($file->header['etag']);
                $temp['size'] = $this->util->size_readable((integer) basename($file->header['content-length']));
                $temp['size_raw'] = basename($file->header['content-length']);
                $temp['last_modified'] = (string) date("jS M Y H:i:s", strtotime($file->header['last-modified']));
                $temp['last_modified_raw'] = strtotime($file->header['last-modified']);
                @$temp['creator_id'] = (string) $file->header['x-amz-meta-creator'];
                @$temp['client_view'] = (string) $file->header['x-amz-meta-client-view'];
                @$temp['user_view'] = (string) $file->header['x-amz-meta-user-view'];

                $result[] = $temp;
            }

            return $result;
        }
    }

176

answered Oct 06 '22 22:10

Stevo

You need to know that list_objects function has limit. It doesn't allows to load more than 1000 objects, even if max-keys option will be set to some large number.

To fix this you need to load data several times:

private function _getBucketObjects($prefix = '', $booOneLevelOny = false)
{
    $objects = array();
    $lastKey = null;
    do {
        $args = array();
        if (isset($lastKey)) {
            $args['marker'] = $lastKey;
        }

        if (strlen($prefix)) {
            $args['prefix'] = $prefix;
        }

        if($booOneLevelOny) {
            $args['delimiter'] = '/';
        }

        $res = $this->_client->list_objects($this->_bucket, $args);
        if (!$res->isOK()) {
            return null;
        }

        foreach ($res->body->Contents as $object) {
            $objects[] = $object;
            $lastKey = (string)$object->Key;
        }
        $isTruncated = (string)$res->body->IsTruncated;
        unset($res);
    } while ($isTruncated == 'true');

    return $objects;
}

As result - you'll have a full list of the objects.

What if you have some custom headers? They will be not returned via list_objects function. In this case this will help:

foreach (array_chunk($arrObjects, 1000) as $object_set) {
    $batch = new CFBatchRequest();
    foreach ($object_set as $object) {
        if(!$this->isFolder((string)$object->Key)) {
            $this->_client->batch($batch)->get_object_headers($this->_bucket, $this->preparePath((string)$object->Key));
        }
    }

    $response = $this->_client->batch($batch)->send();

    if ($response->areOK()) {
        foreach ($response as $arrHeaderInfo) {
            $arrHeaders[] = $arrHeaderInfo->header;
        }
    }
    unset($batch, $response);
}

answered Oct 06 '22 22:10

Andron

Related questions
                            
                                Content Type not enforced in s3 pre-signed POST
                            
                                List S3 objects with public read permissions (in private bucket)
                            
                                s3 proxy on kubernetes using Ingress
                            
                                How to retrieve multiple image from Amazon S3 using imgURL at once?
                            
                                How to use new Hadoop parquet magic commiter to custom S3 server with Spark
                            
                                How to enforce SSL on presigned S3 URL
                            
                                How to enable CORS on nodejs express redirect?
                            
                                Access AWS S3 bucket from a container on a server
                            
                                Configuring ActiveStorage to use S3 with IAM role
                            
                                How to get access to private s3 bucket using cognito
                            
                                How to catch failed S3 copyObject with 200 OK result in AWSJavaScriptSDK
                            
                                How to read Parquet file from S3 without spark? Java
                            
                                How to save file from S3 using aws-sdk v3
                            
                                Cropping using Paperclip, ImageMagick, Jcrop and S3 storage: Why won't 'image.reprocess!' reprocess?
                            
                                Does S3 support HTTP pipelining?
                            
                                AWS Java TransferManager issue, JVM could not obtain any more native threads
                            
                                Amazon CloudFront Cache invalidation strategy to ensure fresh content and cost effectiveness
                            
                                Video Uploading Process and Amazon S3
                            
                                Python Generated Signature for S3 Post
                            
                                How to specify an object expiration prefix that doesn't match the directory?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficient way to extract files and meta data from Amazon S3?

Tags:

amazon-s3

Ben Sinclair

People also ask

2 Answers

Stevo

Andron

Recent Activity

Donate For Us