I have been recently working on a scripts that takes file, chunks it and analyze each piece. Because the chunking positions depend on the content, I need to read it one byte at a time. I do not need random access, just reading it linearly from beginning to end, selecting certain positions as I go and yielding the content of the chunk from the previous selected position to the current one.
It was very convenient to use a memory mapped file wrapped by a bytearray
. Instead of yielding the chunk, I yield the offset and size of the chunk, leaving the outer function to slice it.
It was also faster than accumulating the current chunk in a bytearray
(and much faster than accumulating in bytes
!). But I have certain concerns that I would like to address:
rb
and the mmap
with access=mmap.ACCESS_READ
. But bytearray
is, in principle, a mutable container. Is this a performance problem? Is there a read only container that I should use?bytearray
(and therefore the underlying file). Even though it might be buffered, I am afraid that there will problems depending on the file size and system memory. Is this really a problem?Python supports slice notation for any sequential data type like lists, strings, tuples, bytes, bytearrays, and ranges. Also, any new data structure can add its support as well. This is greatly used (and abused) in NumPy and Pandas libraries, which are so popular in Machine Learning and Data Science.
Python slice() FunctionThe slice() function returns a slice object. A slice object is used to specify how to slice a sequence. You can specify where to start the slicing, and where to end. You can also specify the step, which allows you to e.g. slice only every other item.
The fastest way to split text in Python is with the split() method. This is a built-in method that is useful for separating a string into its individual parts. The split() method will return a list of the elements in a string.
Converting one object to a mutable object does incur data copying. You can directly read the file to a bytearray by using:
f = open(FILENAME, 'rb')
data = bytearray(os.path.getsize(FILENAME))
f.readinto(data)
from http://eli.thegreenplace.net/2011/11/28/less-copies-in-python-with-the-buffer-protocol-and-memoryviews#id12
There is a string to bytearray conversion, so there is potential performance issue.
bytearray is an array, so it can hit the limit of PY_SSIZE_T_MAX/sizeof(PyObject*). For more info, you can visit How Big can a Python Array Get?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With