Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to extract files and meta data from Amazon S3?

Tags:

amazon-s3

Is there a more efficient way to list files from a bucket in Amazon S3 and also extract the meta data for each of those files? I'm using the AWS PHP SDK.

if ($paths = $s3->get_object_list('my-bucket')) {
    foreach($paths AS $path) {
        $meta = $s3->get_object_metadata('my-bucket', $path);
        echo $path . ' was modified on ' . $meta['LastModified'] . '<br />';
    }
}

At the moment I need to run get_object_list() to list all the files and then get_object_metadata() for each file to get its meta data.

If I have 100 files in my bucket, it makes 101 calls to get this data. It would be good if it's possible to do it in 1 call.

E.g:

if ($paths = $s3->get_object_list('my-bucket')) {
    foreach($paths AS $path) {
        echo $path['FileName'] . ' was modified on ' . $path['LastModified'] . '<br />';
    }
}
like image 762
Ben Sinclair Avatar asked Jun 12 '12 02:06

Ben Sinclair


People also ask

How do I extract data from aws S3?

In the Amazon S3 console, choose your S3 bucket, choose the file that you want to open or download, choose Actions, and then choose Open or Download. If you are downloading an object, specify where you want to save it. The procedure for saving the object depends on the browser and operating system that you are using.

How can you download an S3 bucket including all folders and files?

To download an entire bucket to your local file system, use the AWS CLI sync command, passing it the s3 bucket as a source and a directory on your file system as a destination, e.g. aws s3 sync s3://YOUR_BUCKET . . The sync command recursively copies the contents of the source to the destination.

Which S3 storage class takes most time to retrieve data?

Data stored in the S3 Glacier Deep Archive storage class has a minimum storage duration period of 180 days and a default retrieval time of 12 hours. If you have deleted, overwritten, or transitioned to a different storage class an object before the 180-day minimum, you are charged for 180 days.

How can I improve my S3 performance?

You can increase your read or write performance by using parallelization. For example, if you create 10 prefixes in an Amazon S3 bucket to parallelize reads, you could scale your read performance to 55,000 read requests per second. Similarly, you can scale write operations by writing to multiple prefixes.


2 Answers

I know this is a bit old, but I encountered this problem and to solve it I extended the Aws sdk to use the batch functionality for this type of problem. It makes a lot quicker to retrieve custom meta data for lots of files. This is my code:

    /**
     * Name: Steves_Amazon_S3
     * 
     * Extends the AmazonS3 class in order to create a function to 
     * more efficiently retrieve a list of
     * files and their custom metadata using the CFBatchRequest function.
     * 
     * 
     */
    class Steves_Amazon_S3 extends AmazonS3 {

        public function get_object_metadata_batch($bucket, $filenames, $opt = null) {
            $batch = new CFBatchRequest();

            foreach ($filenames as $filename) {

                $this->batch($batch)->get_object_headers($bucket, $filename); // Get content-type
            }

            $response = $this->batch($batch)->send();

            // Fail if any requests were unsuccessful
            if (!$response->areOK()) {
                return false;
            }
            foreach ($response as $file) {
                $temp = array();
                $temp['name'] = (string) basename($file->header['_info']['url']);
                $temp['etag'] = (string) basename($file->header['etag']);
                $temp['size'] = $this->util->size_readable((integer) basename($file->header['content-length']));
                $temp['size_raw'] = basename($file->header['content-length']);
                $temp['last_modified'] = (string) date("jS M Y H:i:s", strtotime($file->header['last-modified']));
                $temp['last_modified_raw'] = strtotime($file->header['last-modified']);
                @$temp['creator_id'] = (string) $file->header['x-amz-meta-creator'];
                @$temp['client_view'] = (string) $file->header['x-amz-meta-client-view'];
                @$temp['user_view'] = (string) $file->header['x-amz-meta-user-view'];

                $result[] = $temp;
            }

            return $result;
        }
    }
like image 176
Stevo Avatar answered Oct 06 '22 22:10

Stevo


You need to know that list_objects function has limit. It doesn't allows to load more than 1000 objects, even if max-keys option will be set to some large number.

To fix this you need to load data several times:

private function _getBucketObjects($prefix = '', $booOneLevelOny = false)
{
    $objects = array();
    $lastKey = null;
    do {
        $args = array();
        if (isset($lastKey)) {
            $args['marker'] = $lastKey;
        }

        if (strlen($prefix)) {
            $args['prefix'] = $prefix;
        }

        if($booOneLevelOny) {
            $args['delimiter'] = '/';
        }

        $res = $this->_client->list_objects($this->_bucket, $args);
        if (!$res->isOK()) {
            return null;
        }

        foreach ($res->body->Contents as $object) {
            $objects[] = $object;
            $lastKey = (string)$object->Key;
        }
        $isTruncated = (string)$res->body->IsTruncated;
        unset($res);
    } while ($isTruncated == 'true');

    return $objects;
}

As result - you'll have a full list of the objects.


What if you have some custom headers? They will be not returned via list_objects function. In this case this will help:

foreach (array_chunk($arrObjects, 1000) as $object_set) {
    $batch = new CFBatchRequest();
    foreach ($object_set as $object) {
        if(!$this->isFolder((string)$object->Key)) {
            $this->_client->batch($batch)->get_object_headers($this->_bucket, $this->preparePath((string)$object->Key));
        }
    }

    $response = $this->_client->batch($batch)->send();

    if ($response->areOK()) {
        foreach ($response as $arrHeaderInfo) {
            $arrHeaders[] = $arrHeaderInfo->header;
        }
    }
    unset($batch, $response);
}
like image 31
Andron Avatar answered Oct 06 '22 22:10

Andron