I am writing a parser for a protocol working over TCP.
Some messages are split between multiple packets so I need to be able to "peek" into my stream with a possibility of going back and also to append incoming data at the end. On the other hand, I would like to be able to discard the content of the packets I have successfully parsed.
bytes
is that appending requires copying (not in CPython, but then it is also impossible to delete the first bytes in an immutable object).bytearray
is that flushing the already seen bytes also requires copying (or so I thought, see below)collections.deque
is the huge memory requirement. Same with list
.However, I did some tests with bytearray
and it seems the pop(0) operation is far more efficient than with lists:
from time import time
n = 100000
for container in [bytearray, list]:
print(container)
a = container(b'a'*n)
t = time()
for i in range(n):
del a[0]
print('del a[0]', time() - t)
a = container(b'a'*n)
t = time()
for i in range(n):
del a[-1]
print('del a[-1]', time() - t)
a = container(b'a'*n)
t = time()
for i in range(n-1):
del a[1]
print('del a[1]', time() - t)
a = container(b'a'*n)
t = time()
for i in range(n-1):
del a[-2]
print('del a[-2]', time() - t)
print()
It seems that del a[0]
and del a[-1]
have about the same complexity for bytearray
, in cpython2, cpython3 and pypy3.
I would like to know:
How is that possible? Is there a more efficient way than del a[:k]
to delete the first k
bytes?
Is there a more efficient data structure than the bytearray
? (maybe using array
, memoryview
or ctypes
)
The Python bytearray() function converts strings or collections of integers into a mutable sequence of bytes. It provides developers the usual methods Python affords to both mutable and byte data types.
The difference between bytes() and bytearray() is that bytes() returns an object that cannot be modified, and bytearray() returns an object that can be modified.
The bytearray type is a mutable sequence of integers in the range between 0 and 255. It allows you to work directly with binary data. It can be used to work with low-level data such as that inside of images or arriving directly from the network. Bytearray type inherits methods from both list and str types.
The bytearray type is a mutable sequence of integers in the range 0 <= x < 256. The reason you can create it with a string as each character is converted to its ASCII integer value. So when assigning 'H' you actually mean to assign 72 . If you want to be able to assign chars, then just pass each one into ord() first.
Python deliberately sacrifices code performance for programmer's performance.
Use whatever is the most convenient to use.
When you have a correctly working implementation and the performance proves to be inadequate, replace critical bits only (as shown by profiling) with faster equivalents. See https://wiki.python.org/moin/PythonSpeed/PerformanceTips#Overview:_Optimize_what_needs_optimizing for more info.
That said, a prime candidate for the use case you described would be a "chunked buffer" that would return slices transparently from a series of buffers.
Extracting data from it will still require copying (since all standard Python types own their memory), and you'll have interpreter overhead if you implement the type in pure Python. So to get any significant improvement, you're likely to have to go into Cython/C or something. That's why it's so important to get the general design right first -- in pure Python, it's much easier to change things.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With