Checking the documentation on memoryview: <blockquote> memoryview objects allow Python code to access the internal data of an object that supports the buffer protocol without copying. class memoryview(obj) Create a memoryview that references obj. obj must support the buffer protocol. Built-in objects that support the buffer protocol include bytes and bytearray. </blockquote> Then we are given the sample code: <pre class="prettyprint"><code>>>> v = memoryview(b'abcefg') >>> v[1] 98 >>> v[-1] 103 >>> v[1:4] <memory at 0x7f3ddc9f4350> >>> bytes(v[1:4]) b'bce' </code></pre> Quotation over, now lets take a closer look: <pre class="prettyprint"><code>>>> b = b'long bytes stream' >>> b.startswith(b'long') True >>> v = memoryview(b) >>> vsub = v[5:] >>> vsub.startswith(b'bytes') Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'memoryview' object has no attribute 'startswith' >>> bytes(vsub).startswith(b'bytes') True >>> </code></pre> So what I gather from the above: We create a memoryview object to expose the internal data of a buffer object without copying, however, in order to do anything useful with the object (by calling the methods provided by the object), we have to create a copy! Usually memoryview (or the old buffer object) would be needed when we have a large object, and the slices can be large too. The need for a better efficiency would be present if we are making large slices, or making small slices but a large number of times. With the above scheme, I don't see how it can be useful for either situation, unless someone can explain to me what I'm missing here. Edit1: We have a large chunk of data, we want to process it by advancing through it from start to end, for example extracting tokens from the start of a string buffer until the buffer is consumed.In C term, this is advancing a pointer through the buffer, and the pointer can be passed to any function expecting the buffer type. How can something similar be done in python? People suggest workarounds, for example many string and regex functions take position arguments that can be used to emulate advancing a pointer. There're two issues with this: first it's a work around, you are forced to change your coding style to overcome the shortcomings, and second: not all functions have position arguments, for example regex functions and <code>startswith</code> do, <code>encode()</code>/<code>decode()</code> don't. Others might suggest to load the data in chunks, or processing the buffer in small segments larger than the max token. Okay so we are aware of these possible workarounds, but we are supposed to work in a more natural way in python without trying to bend the coding style to fit the language - aren't we? Edit2: A code sample would make things clearer. This is what I want to do, and what I assumed memoryview would allow me to do at first glance. Lets use pmview (proper memory view) for the functionality I'm looking for: <pre class="prettyprint"><code>tokens = [] xlarge_str = get_string() xlarge_str_view = pmview(xlarge_str) while True: token = get_token(xlarge_str_view) if token: xlarge_str_view = xlarge_str_view.vslice(len(token)) # vslice: view slice: default stop paramter at end of buffer tokens.append(token) else: break </code></pre>

One reason <code>memoryview</code>s are useful is that they can be sliced without copying the underlying data, unlike <code>bytes</code>/<code>str</code>. For example, take the following toy example. <pre class="prettyprint"><code>import time for n in (100000, 200000, 300000, 400000): data = b'x'*n start = time.time() b = data while b: b = b[1:] print(f' bytes {n} {time.time() - start:0.3f}') for n in (100000, 200000, 300000, 400000): data = b'x'*n start = time.time() b = memoryview(data) while b: b = b[1:] print(f'memoryview {n} {time.time() - start:0.3f}') </code></pre> On my computer, I get <pre class="prettyprint"><code> bytes 100000 0.211 bytes 200000 0.826 bytes 300000 1.953 bytes 400000 3.514 memoryview 100000 0.021 memoryview 200000 0.052 memoryview 300000 0.043 memoryview 400000 0.077 </code></pre> You can clearly see the quadratic complexity of the repeated string slicing. Even with only 400000 iterations, it's already unmanageable. Meanwhile, the <code>memoryview</code> version has linear complexity and is lightning fast. Edit: Note that this was done in CPython. There was a bug in Pypy up to 4.0.1 that caused memoryviews to have quadratic performance.

What exactly is the point of memoryview in Python

Tags:

python

buffer

memoryview

Checking the documentation on memoryview:

memoryview objects allow Python code to access the internal data of an object that supports the buffer protocol without copying.

class memoryview(obj)

Create a memoryview that references obj. obj must support the buffer protocol. Built-in objects that support the buffer protocol include bytes and bytearray.

Then we are given the sample code:

>>> v = memoryview(b'abcefg') >>> v[1] 98 >>> v[-1] 103 >>> v[1:4] <memory at 0x7f3ddc9f4350> >>> bytes(v[1:4]) b'bce'

Quotation over, now lets take a closer look:

>>> b = b'long bytes stream' >>> b.startswith(b'long') True >>> v = memoryview(b) >>> vsub = v[5:] >>> vsub.startswith(b'bytes') Traceback (most recent call last):   File "<stdin>", line 1, in <module> AttributeError: 'memoryview' object has no attribute 'startswith' >>> bytes(vsub).startswith(b'bytes') True >>>

So what I gather from the above:

We create a memoryview object to expose the internal data of a buffer object without copying, however, in order to do anything useful with the object (by calling the methods provided by the object), we have to create a copy!

Usually memoryview (or the old buffer object) would be needed when we have a large object, and the slices can be large too. The need for a better efficiency would be present if we are making large slices, or making small slices but a large number of times.

With the above scheme, I don't see how it can be useful for either situation, unless someone can explain to me what I'm missing here.

Edit1:

We have a large chunk of data, we want to process it by advancing through it from start to end, for example extracting tokens from the start of a string buffer until the buffer is consumed.In C term, this is advancing a pointer through the buffer, and the pointer can be passed to any function expecting the buffer type. How can something similar be done in python?

People suggest workarounds, for example many string and regex functions take position arguments that can be used to emulate advancing a pointer. There're two issues with this: first it's a work around, you are forced to change your coding style to overcome the shortcomings, and second: not all functions have position arguments, for example regex functions and startswith do, encode()/decode() don't.

Others might suggest to load the data in chunks, or processing the buffer in small segments larger than the max token. Okay so we are aware of these possible workarounds, but we are supposed to work in a more natural way in python without trying to bend the coding style to fit the language - aren't we?

Edit2:

A code sample would make things clearer. This is what I want to do, and what I assumed memoryview would allow me to do at first glance. Lets use pmview (proper memory view) for the functionality I'm looking for:

tokens = [] xlarge_str = get_string() xlarge_str_view =  pmview(xlarge_str)  while True:     token =  get_token(xlarge_str_view)     if token:          xlarge_str_view = xlarge_str_view.vslice(len(token))          # vslice: view slice: default stop paramter at end of buffer         tokens.append(token)     else:            break

673

asked Sep 06 '13 10:09

Basel Shishani

1 Answers

One reason memoryviews are useful is that they can be sliced without copying the underlying data, unlike bytes/str.

For example, take the following toy example.

import time for n in (100000, 200000, 300000, 400000):     data = b'x'*n     start = time.time()     b = data     while b:         b = b[1:]     print(f'     bytes {n} {time.time() - start:0.3f}')  for n in (100000, 200000, 300000, 400000):     data = b'x'*n     start = time.time()     b = memoryview(data)     while b:         b = b[1:]     print(f'memoryview {n} {time.time() - start:0.3f}')

On my computer, I get

     bytes 100000 0.211      bytes 200000 0.826      bytes 300000 1.953      bytes 400000 3.514 memoryview 100000 0.021 memoryview 200000 0.052 memoryview 300000 0.043 memoryview 400000 0.077

You can clearly see the quadratic complexity of the repeated string slicing. Even with only 400000 iterations, it's already unmanageable. Meanwhile, the memoryview version has linear complexity and is lightning fast.

Edit: Note that this was done in CPython. There was a bug in Pypy up to 4.0.1 that caused memoryviews to have quadratic performance.

162

answered Sep 18 '22 01:09

Antimony

Related questions
                            
                                "python" not recognized as a command
                            
                                Installing lxml module in python
                            
                                How to implement virtual methods in Python?
                            
                                Efficiently generate a 16-character, alphanumeric string
                            
                                Why is '+' not understood by Python sets?
                            
                                How to get the difference between two dictionaries in Python?
                            
                                Understanding min_df and max_df in scikit CountVectorizer
                            
                                Choosing the correct upper and lower HSV boundaries for color detection with`cv::inRange` (OpenCV)
                            
                                Public free web services for testing soap client [closed]
                            
                                Why are assertEquals() parameters in the order (expected, actual)?
                            
                                WhatsApp API (java/python) [closed]
                            
                                What is the role of TimeDistributed layer in Keras?
                            
                                Add numpy array as column to Pandas data frame
                            
                                Python regex - r prefix
                            
                                Two-sample Kolmogorov-Smirnov Test in Python Scipy
                            
                                How to do an upsert with SqlAlchemy?
                            
                                background function in Python
                            
                                Why can I use a list index as an indexing variable in a for loop? [duplicate]
                            
                                Imshow: extent and aspect
                            
                                How to transform numpy.matrix or array to scipy sparse matrix

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With