Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Downloading a large archive from AWS Glacier using Boto

I am trying to download a large archive (~ 1 TB) from Glacier using the Python package, Boto. The current method that I am using looks like this:

import os
import boto.glacier
import boto
import time

ACCESS_KEY_ID = 'XXXXX'
SECRET_ACCESS_KEY = 'XXXXX'
VAULT_NAME = 'XXXXX'
ARCHIVE_ID = 'XXXXX'
OUTPUT = 'XXXXX'

layer2 = boto.connect_glacier(aws_access_key_id = ACCESS_KEY_ID,
                              aws_secret_access_key = SECRET_ACCESS_KEY)

gv = layer2.get_vault(VAULT_NAME)

job = gv.retrieve_archive(ARCHIVE_ID)
job_id = job.id

while not job.completed:
    time.sleep(10)
    job = gv.get_job(job_id)

if job.completed:
    print "Downloading archive"
    job.download_to_file(OUTPUT)

The problem is that the job ID expires after 24 hours, which is not enough time to retrieve the entire archive. I will need to break the download into at least 4 pieces. How can I do this and write the output to a single file?

like image 772
sahoang Avatar asked Jan 16 '15 13:01

sahoang


1 Answers

It seems that you can simply specify the chunk_size parameter when calling job.download_to_file like so :

if job.completed:
    print "Downloading archive"
    job.download_to_file(OUTPUT, chunk_size=1024*1024)

However, if you can't download the all the chunks during the 24 hours I don't think you can choose to download only the one you missed using layer2.

First method

Using layer1 you can simply use the method get_job_output and specify the byte-range you want to download.

It would look like that :

file_size = check_file_size(OUTPUT)

if job.completed:
    print "Downloading archive"
    with open(OUTPUT, 'wb') as output_file:
        i = 0
        while True:
            response = gv.get_job_output(VAULT_NAME, job_id, (file_size + 1024 * 1024 * i, file_size + 1024 * 1024 * (i + 1)))
            output_file.write(response)
            if len(response) < 1024 * 1024:
                break
            i += 1

With this script you should be able to rerun the script when it fails and continue to download your archive where you left it.

Second method

By digging in the boto code I found a "private" method in the Job class that you might also use : _download_byte_range. With this method you can still use layer2.

file_size = check_file_size(OUTPUT)

if job.completed:
    print "Downloading archive"
    with open(OUTPUT, 'wb') as output_file:
        i = 0
        while True:
            response = job._download_byte_range(file_size + 1024 * 1024 * i, file_size + 1024 * 1024 * (i + 1)))
            output_file.write(response)
            if len(response) < 1024 * 1024:
                break
            i += 1
like image 85
volent Avatar answered Oct 26 '22 15:10

volent