Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Boto "get byte range" returns more than expected

This is my first question here as I'm fairly new to this world! I've spent a few days trying to figure this out for myself, but haven't so far been able to find any useful info.

I'm trying to retrieve a byte range from a file stored in S3, using something like:

S3Key.get_contents_to_file(tempfile, headers={'Range': 'bytes=0-100000'}

The file that I'm trying to restore from is a video file, specifically an MXF. When I request a byte range, I get back more info in the tempfile than requested. For example, using one file, I request 100,000 bytes and get back 100,451.

One thing to note about MXF files is that they legitimately contain 0x0A (ASCII line feed) and 0x0D (ASCII carriage return).

I had a dig around and it appears that any time a 0D byte is present in the file, the retrieved info adds 0A 0D instead of just 0D, therefore appearing to retrieve more info than required.

As an example, original file contains the Hex string of:

02 03 00 00 00 00 3B 0A 06 0E 2B 34 01 01 01 05

But the file downloaded form S3 has:

02 03 00 00 00 00 3B 0D 0A 06 0E 2B 34 01 01 01 05

I've tried to debug the code and work my way through the Boto logic, but I'm relatively new at this, so get lost very easily.

I created this for testing, which shows the issue

from boto.s3.connection import S3Connection
from boto.s3.connection import Location
from boto.s3.key import Key
import boto
import os


## AWS credentials
AWS_ACCESS_KEY_ID = 'secret key'
AWS_SECRET_ACCESS_KEY = 'access key'

## Bucket name and path to file
bucketName = 'bucket name'
filePath = 'path/to/file.mxf'

#Local temp file to download to
tempFilePath = 'c:/tmp/tempfile'


## Setup the S3 connection and create a Key to access the file specified
## in filePath
conn = S3Connection(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
bucket = conn.get_bucket(bucketName)
S3Key = Key(bucket)
S3Key.key = filePath

def testRangeGet(bytesToRead=100000): # default read of 100K
    tempfile = open(tempFilePath, 'w')
    rangeString = 'bytes=0-' + str(bytesToRead -1)  #create byte range as string
    rangeDict = {'Range': rangeString} # add this to the dictionary
    S3Key.get_contents_to_file(tempfile, headers=rangeDict) # using Boto
    tempfile.close()
    bytesRead = os.path.getsize(tempFilePath)
    print 'Bytes requested = ' + str(bytesToRead)
    print 'Bytes recieved = ' + str(bytesRead)
    print 'Additional bytes = ' + str(bytesRead - bytesToRead)

I guess there is something in the Boto code that is looking out for certain ASCII escape characters and modifying them, and I can't find any way to specify to just treat it as a binary file.

Has anyone had a similar problem and can share a way around it?

Thanks

Tim

like image 992
Tim Davis Avatar asked Nov 02 '22 21:11

Tim Davis


1 Answers

Open your output file as a binary file. Otherwise writing into that file will convert LF to CR/LF automatically.

tempfile = open(tempFilePath, 'wb')

That of course is only necessary on Windows systems. Unixes won't convert anything, regardless whether a file has been opened as text or as binary file.

You should take care when uploading as well that you don't get such-like corrupted data into S3 in the first place.

like image 139
Alfe Avatar answered Nov 11 '22 23:11

Alfe