Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does slicing bytes object creates a whole new copy of data in python

Say I have very large bytes object (after loading binary file) and I want to read parts by parts and advance the starting position until it meets the end. I use slicing to accomplish this. I'm worried that python will create completely new copy each time I ask for a slice instead of simply giving me the address of the memory pointing to the position I want.

Simple example:

data = Path("binary-file.dat").read_bytes()
total_length = len(data)
start_pos = 0

while start_pos < total_length:
   bytes_processed = decode_bytes(data[start_pos:])  # <---- ***
   start_pos += bytes_processed 

In the above example does python creates completely new copy of bytes object starting from the start_pos due to the slicing. If so what is the best way to avoid data copy and use just a pointer to pass to the relevant position of the bytes array.

like image 781
Tekz Avatar asked Jun 23 '20 10:06

Tekz


People also ask

Does Python string slice make a copy?

Python does slice-by-copy, meaning every time you slice (except for very trivial slices, such as a[:] ), it copies all of the data into a new string object. The [slice-by-reference] approach is more complicated, harder to implement and may lead to unexpected behavior.

How does bytearray work Python?

Python | bytearray() function bytearray() method returns a bytearray object which is an array of given bytes. It gives a mutable sequence of integers in the range 0 <= x < 256. Returns: Returns an array of bytes of the given size. source parameter can be used to initialize the array in few different ways.

Why use bytearray Python?

The Python bytearray() function converts strings or collections of integers into a mutable sequence of bytes. It provides developers the usual methods Python affords to both mutable and byte data types. Python's bytearray() built-in allows for high-efficiency manipulation of data in several common situations.

How do you break bytes in Python?

Solution: To split a byte string into a list of lines—each line being a byte string itself—use the Bytes. split(delimiter) method and use the Bytes newline character b'\n' as a delimiter.


1 Answers

Yes, slicing a bytes object does create a copy, at least as of CPython 3.9.12. The closest the documentation comes to admitting this is in the description of the bytes constructor:

In addition to the literal forms, bytes objects can be created in a number of other ways:

  • A zero-filled bytes object of a specified length: bytes(10)
  • From an iterable of integers: bytes(range(20))
  • Copying existing binary data via the buffer protocol: bytes(obj)

which suggests any creation of a bytes object creates a separate copy of the data. But since I had a hard time finding an explicit confirmation that slicing does the same, I resorted to an empirical test.

>>> b = b'\1' * 100_000_000
>>> qq = [b[1:] for _ in range(20)]

After executing the first line, memory usage of the python3 process in top was about 100 MB. The second executed after a considerable delay, making memory usage rise to the level of 2G. This seems pretty conclusive. PyPy 7.3.9 targetting Python 3.8 behaves largely the same; though of course, PyPy’s garbage collection is not as eager as CPython’s, so the memory is not freed as soon as the bytes objects become unreachable.

To avoid copying the underlying buffer, wrap your bytes in a memoryview and slice that:

>>> bm = memoryview(b)
>>> qq = [bm[1:] for _ in range(50)]
like image 133
user3840170 Avatar answered Oct 31 '22 20:10

user3840170