While looking around for ideas I found https://stackoverflow.com/a/54222447/264822 for zip files which I think is a very clever solution. But it relies on zip files having a Central Directory - tar files don't.
I thought I could follow the same general principle and expose the S3 file to tarfile through the fileobj
parameter:
import boto3
import io
import tarfile
class S3File(io.BytesIO):
def __init__(self, bucket_name, key_name, s3client):
super().__init__()
self.bucket_name = bucket_name
self.key_name = key_name
self.s3client = s3client
self.offset = 0
def close(self):
return
def read(self, size):
print('read: offset = {}, size = {}'.format(self.offset, size))
start = self.offset
end = self.offset + size - 1
try:
s3_object = self.s3client.get_object(Bucket=self.bucket_name, Key=self.key_name, Range="bytes=%d-%d" % (start, end))
except:
return bytearray()
self.offset = self.offset + size
result = s3_object['Body'].read()
return result
def seek(self, offset, whence=0):
if whence == 0:
print('seek: offset {} -> {}'.format(self.offset, offset))
self.offset = offset
def tell(self):
return self.offset
s3file = S3File(bucket_name, file_name, s3client)
tarf = tarfile.open(fileobj=s3file)
names = tarf.getnames()
for name in names:
print(name)
This works fine except the output looks like:
read: offset = 0, size = 2
read: offset = 2, size = 8
read: offset = 10, size = 8192
read: offset = 8202, size = 1235
read: offset = 9437, size = 1563
read: offset = 11000, size = 3286
read: offset = 14286, size = 519
read: offset = 14805, size = 625
read: offset = 15430, size = 1128
read: offset = 16558, size = 519
read: offset = 17077, size = 573
read: offset = 17650, size = 620
(continued)
tarfile is just reading the whole file anyway so I haven't gained anything. Is there anyway of making tarfile only read the parts of the file it needs? The only alternative I can think of is re-implementing the tar file parsing so it:
BytesIO
buffer.BytesIO
buffer.But this seems overly complicated.
My mistake. I'm actually dealing with tar.gz files but I assumed that zip and tar.gz are similar. They're not - tar is an archive file which is then compressed as gzip, so to read the tar you have to decompress it first. My idea of pulling bits out of the tar file won't work.
What does work is:
s3_object = s3client.get_object(Bucket=bucket_name, Key=file_name)
wholefile = s3_object['Body'].read()
fileobj = io.BytesIO(wholefile)
tarf = tarfile.open(fileobj=fileobj)
names = tarf.getnames()
for name in names:
print(name)
I suspect the original code will work for a tar file but I don't have any to try it on.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With