Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Slicing a file in Python

I have been recently working on a scripts that takes file, chunks it and analyze each piece. Because the chunking positions depend on the content, I need to read it one byte at a time. I do not need random access, just reading it linearly from beginning to end, selecting certain positions as I go and yielding the content of the chunk from the previous selected position to the current one.

It was very convenient to use a memory mapped file wrapped by a bytearray. Instead of yielding the chunk, I yield the offset and size of the chunk, leaving the outer function to slice it.

It was also faster than accumulating the current chunk in a bytearray (and much faster than accumulating in bytes!). But I have certain concerns that I would like to address:

  1. Is bytearray copying the data?
  2. I open the file as rb and the mmap with access=mmap.ACCESS_READ. But bytearray is, in principle, a mutable container. Is this a performance problem? Is there a read only container that I should use?
  3. Because I do not accumulate in the buffer, I am random accessing the bytearray (and therefore the underlying file). Even though it might be buffered, I am afraid that there will problems depending on the file size and system memory. Is this really a problem?
like image 503
Hernan Avatar asked Nov 01 '14 15:11

Hernan


People also ask

Can you slice a file in Python?

Python supports slice notation for any sequential data type like lists, strings, tuples, bytes, bytearrays, and ranges. Also, any new data structure can add its support as well. This is greatly used (and abused) in NumPy and Pandas libraries, which are so popular in Machine Learning and Data Science.

How do you do slicing in Python?

Python slice() FunctionThe slice() function returns a slice object. A slice object is used to specify how to slice a sequence. You can specify where to start the slicing, and where to end. You can also specify the step, which allows you to e.g. slice only every other item.

How do you split the contents of a file in Python?

The fastest way to split text in Python is with the split() method. This is a built-in method that is useful for separating a string into its individual parts. The split() method will return a list of the elements in a string.


1 Answers

  1. Converting one object to a mutable object does incur data copying. You can directly read the file to a bytearray by using:

    f = open(FILENAME, 'rb')
    data = bytearray(os.path.getsize(FILENAME))
    f.readinto(data)
    

from http://eli.thegreenplace.net/2011/11/28/less-copies-in-python-with-the-buffer-protocol-and-memoryviews#id12

  1. There is a string to bytearray conversion, so there is potential performance issue.

  2. bytearray is an array, so it can hit the limit of PY_SSIZE_T_MAX/sizeof(PyObject*). For more info, you can visit How Big can a Python Array Get?

like image 180
snowblade Avatar answered Oct 19 '22 03:10

snowblade