Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Prefer BytesIO or bytes for internal interface in Python?

I'm trying to decide on the best internal interface to use in my code, specifically around how to handle file contents. Really, the file contents are just binary data, so bytes is sufficient to represent them.

I'm storing files in different remote locations, so have a couple of different classes for reading and writing. I'm trying to figure out the best interface to use for my functions. Originally I was using file paths, but that was suboptimal because it meant that disk was always used (which meant lots of clumsy tempfiles).

There are several areas of the code that have the same requirement, and would directly use whatever was returned from this interface. As a result whatever abstraction I choose will touch a fair bit of code.

What are the various tradeoffs to using BytesIO vs bytes?

def put_file(location, contents_as_bytes):
def put_file(location, contents_as_fp):
def get_file_contents(location):
def get_file_contents(location, fp):

Playing around I've found that using the File-Like interfaces (BytesIO, etc) requires a bit of administration overhead in terms of seek(0) etc. That raises a questions like:

  • is it better to seek before you start, or after you've finished?
  • do you seek to the start or just operate from the position the file is in?
  • should you tell() to maintain the position?
  • looking at something like shutil.copyfileobj it doesn't do any seeking

One advantage I've found with using file-like interfaces instead is that it allows for passing in the fp to write into when you're retrieving data. Which seems to give a good deal of flexibility.

def get_file_contents(location, write_into=None):
    if not write_into:
        write_into = io.BytesIO()

    # get the contents and put it into write_into

    return write_into

get_file_contents('blah', file_on_disk)
get_file_contents('blah', gzip_file)
get_file_contents('blah', temp_file)
get_file_contents('blah', bytes_io)
new_bytes_io = get_file_contents('blah')
# etc

Is there a good reason to prefer BytesIO over just using fixed bytes when designing an interface in python?

like image 566
Aidan Kane Avatar asked Feb 18 '15 17:02

Aidan Kane


1 Answers

The benefit of io.BytesIO objects is that they implement a common-ish interface (commonly known as a 'file-like' object). BytesIO objects have an internal pointer (whose position is returned by tell()) and for every call to read(n) the pointer advances n bytes. Ex.

import io

buf = io.BytesIO(b'Hello world!')
buf.read(1) # Returns b'H'

buf.tell()  # Returns 1
buf.read(1) # Returns b'e'

buf.tell() # Returns 2

# Set the pointer to 0.
buf.seek(0)
buf.read() # This will return b'H', like the first call.

In your use case, both the bytes object and the io.BytesIO object are maybe not the best solutions. They will read the complete contents of your files into memory.

Instead, you could look at tempfile.TemporaryFile (https://docs.python.org/3/library/tempfile.html).

like image 161
Cochise Ruhulessin Avatar answered Sep 20 '22 13:09

Cochise Ruhulessin