Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to list files inside tar in AWS S3 without downloading it?

While looking around for ideas I found https://stackoverflow.com/a/54222447/264822 for zip files which I think is a very clever solution. But it relies on zip files having a Central Directory - tar files don't.

I thought I could follow the same general principle and expose the S3 file to tarfile through the fileobj parameter:

import boto3
import io
import tarfile

class S3File(io.BytesIO):
    def __init__(self, bucket_name, key_name, s3client):
        super().__init__()
        self.bucket_name = bucket_name
        self.key_name = key_name
        self.s3client = s3client
        self.offset = 0

    def close(self):
        return

    def read(self, size):
        print('read: offset = {}, size = {}'.format(self.offset, size))
        start = self.offset
        end = self.offset + size - 1
        try:
            s3_object = self.s3client.get_object(Bucket=self.bucket_name, Key=self.key_name, Range="bytes=%d-%d" % (start, end))
        except:
            return bytearray()
        self.offset = self.offset + size
        result = s3_object['Body'].read()
        return result

    def seek(self, offset, whence=0):
        if whence == 0:
            print('seek: offset {} -> {}'.format(self.offset, offset))
            self.offset = offset

    def tell(self):
        return self.offset

s3file = S3File(bucket_name, file_name, s3client)
tarf = tarfile.open(fileobj=s3file)
names = tarf.getnames()
for name in names:
    print(name)

This works fine except the output looks like:

read: offset = 0, size = 2
read: offset = 2, size = 8
read: offset = 10, size = 8192
read: offset = 8202, size = 1235
read: offset = 9437, size = 1563
read: offset = 11000, size = 3286
read: offset = 14286, size = 519
read: offset = 14805, size = 625
read: offset = 15430, size = 1128
read: offset = 16558, size = 519
read: offset = 17077, size = 573
read: offset = 17650, size = 620
(continued)

tarfile is just reading the whole file anyway so I haven't gained anything. Is there anyway of making tarfile only read the parts of the file it needs? The only alternative I can think of is re-implementing the tar file parsing so it:

  1. Reads the 512 bytes header and writes this into a BytesIO buffer.
  2. Gets the size of the file following and writes zeroes into the BytesIO buffer.
  3. Skips over the file to the next header.

But this seems overly complicated.

like image 583
parsley72 Avatar asked May 11 '19 01:05

parsley72


1 Answers

My mistake. I'm actually dealing with tar.gz files but I assumed that zip and tar.gz are similar. They're not - tar is an archive file which is then compressed as gzip, so to read the tar you have to decompress it first. My idea of pulling bits out of the tar file won't work.

What does work is:

s3_object = s3client.get_object(Bucket=bucket_name, Key=file_name)
wholefile = s3_object['Body'].read()
fileobj = io.BytesIO(wholefile)
tarf = tarfile.open(fileobj=fileobj)
names = tarf.getnames()
for name in names:
    print(name)

I suspect the original code will work for a tar file but I don't have any to try it on.

like image 78
parsley72 Avatar answered Oct 12 '22 13:10

parsley72