Slicing a file in Python

Tags:

I have been recently working on a scripts that takes file, chunks it and analyze each piece. Because the chunking positions depend on the content, I need to read it one byte at a time. I do not need random access, just reading it linearly from beginning to end, selecting certain positions as I go and yielding the content of the chunk from the previous selected position to the current one.

It was very convenient to use a memory mapped file wrapped by a bytearray. Instead of yielding the chunk, I yield the offset and size of the chunk, leaving the outer function to slice it.

It was also faster than accumulating the current chunk in a bytearray (and much faster than accumulating in bytes!). But I have certain concerns that I would like to address:

Is bytearray copying the data?
I open the file as rb and the mmap with access=mmap.ACCESS_READ. But bytearray is, in principle, a mutable container. Is this a performance problem? Is there a read only container that I should use?
Because I do not accumulate in the buffer, I am random accessing the bytearray (and therefore the underlying file). Even though it might be buffered, I am afraid that there will problems depending on the file size and system memory. Is this really a problem?

503

asked Nov 01 '14 15:11

Hernan

1 Answers

Converting one object to a mutable object does incur data copying. You can directly read the file to a bytearray by using:
```
f = open(FILENAME, 'rb')
data = bytearray(os.path.getsize(FILENAME))
f.readinto(data)
```

from http://eli.thegreenplace.net/2011/11/28/less-copies-in-python-with-the-buffer-protocol-and-memoryviews#id12

There is a string to bytearray conversion, so there is potential performance issue.
bytearray is an array, so it can hit the limit of PY_SSIZE_T_MAX/sizeof(PyObject*). For more info, you can visit How Big can a Python Array Get?

180

answered Oct 19 '22 03:10

snowblade

Related questions
                            
                                Multiclass Classification with LightGBM
                            
                                Pandas DataFrame slices vs copies: which one is more memory friendly?
                            
                                Placing Python objects in shared memory
                            
                                Floating Point Exception with Numpy and PyTables
                            
                                How to implement redis's pubsub timeout feature?
                            
                                How can I dynamically update my matplotlib figure as the data file changes?
                            
                                Google Charts y-axis values
                            
                                Referencing parameters in a Python docstring
                            
                                How to accomplish query notification on SQL Server with python
                            
                                Lex strings with single, double, or triple quotes
                            
                                Adjusting axis range and tick marks in python ggplot
                            
                                Looking for a better strategy for an SQLAlchemy bulk upsert
                            
                                PyCharm: How can I use breakpoints in multithreaded code?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Slicing a file in Python

Tags:

python

file

bytearray

buffer

Hernan

People also ask

1 Answers

snowblade

Recent Activity

Donate For Us