Say I have very large bytes object (after loading binary file) and I want to read parts by parts and advance the starting position until it meets the end. I use slicing to accomplish this. I'm worried that python will create completely new copy each time I ask for a slice instead of simply giving me the address of the memory pointing to the position I want.
Simple example:
data = Path("binary-file.dat").read_bytes()
total_length = len(data)
start_pos = 0
while start_pos < total_length:
bytes_processed = decode_bytes(data[start_pos:]) # <---- ***
start_pos += bytes_processed
In the above example does python creates completely new copy of bytes object starting from the start_pos
due to the slicing. If so what is the best way to avoid data copy and use just a pointer to pass to the relevant position of the bytes array.
Python does slice-by-copy, meaning every time you slice (except for very trivial slices, such as a[:] ), it copies all of the data into a new string object. The [slice-by-reference] approach is more complicated, harder to implement and may lead to unexpected behavior.
Python | bytearray() function bytearray() method returns a bytearray object which is an array of given bytes. It gives a mutable sequence of integers in the range 0 <= x < 256. Returns: Returns an array of bytes of the given size. source parameter can be used to initialize the array in few different ways.
The Python bytearray() function converts strings or collections of integers into a mutable sequence of bytes. It provides developers the usual methods Python affords to both mutable and byte data types. Python's bytearray() built-in allows for high-efficiency manipulation of data in several common situations.
Solution: To split a byte string into a list of lines—each line being a byte string itself—use the Bytes. split(delimiter) method and use the Bytes newline character b'\n' as a delimiter.
Yes, slicing a bytes object does create a copy, at least as of CPython 3.9.12. The closest the documentation comes to admitting this is in the description of the bytes
constructor:
In addition to the literal forms, bytes objects can be created in a number of other ways:
- A zero-filled bytes object of a specified length:
bytes(10)
- From an iterable of integers:
bytes(range(20))
- Copying existing binary data via the buffer protocol:
bytes(obj)
which suggests any creation of a bytes object creates a separate copy of the data. But since I had a hard time finding an explicit confirmation that slicing does the same, I resorted to an empirical test.
>>> b = b'\1' * 100_000_000
>>> qq = [b[1:] for _ in range(20)]
After executing the first line, memory usage of the python3
process in top
was about 100 MB. The second executed after a considerable delay, making memory usage rise to the level of 2G. This seems pretty conclusive. PyPy 7.3.9 targetting Python 3.8 behaves largely the same; though of course, PyPy’s garbage collection is not as eager as CPython’s, so the memory is not freed as soon as the bytes
objects become unreachable.
To avoid copying the underlying buffer, wrap your bytes
in a memoryview
and slice that:
>>> bm = memoryview(b)
>>> qq = [bm[1:] for _ in range(50)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With