Is there a Python module for transparently working with a file's contents as a buffer?

Tags:

python

I'm working on a pure Python file parser for event logs, which may range in size from kilobytes to gigabytes. Is there a module that abstracts explicit .open()/.seek()/.read()/.close() calls into a simple buffer-like object? You might think of this as the inverse of StringIO. I expect it might look something like:

Click to copy

with FileBackedBuffer('/my/favorite/path', 'rb') as buf:
    header = buf[0:0x10]
    footer = buf[0x10000000:]

The mmap module may fulfill my requirements; however, I have two reservations that I'd appreciate feedback on:

It is important that the module handle files larger than available RAM/swap. I am unsure if mmap can do this well.
The mmap constructors are different depending on OS. This makes me hesitant as I am looking to write nicely cross-platform code, and would rather not muck in OS specifics. I will if I need to, but this set off a warning that I might be looking in the wrong place.

If mmap is the correct module for such as task, how does it handle these two points? If it is not, what is an appropriate module?

336

asked Dec 24 '12 05:12

Willi Ballenthin

1 Answers

mmap can easily handle files larger than RAM/swap. What mmap can't do is handle files larger than the address space, which means that 32bit systems are limited in how large a file they can use.

What happens with mmap is that the OS will only have in memory as much data as it it chooses to, but you program will think it is all there. Be careful in usage patters though since if your data DOESN'T fit in RAM and you jump around too randomly, it will swap (discard pages from your file that you haven't used recently to make room for the new pages to be loaded).

If you don't need to specify anything base fileno and length, I don't believe you need to worry about the platform specific arguments for mmap. If you do need to worry about the extra arguments, then you will either have to master Windows versus Unix, or pass that on to your users. I don't know what your library will be, but it may be nice to provide reasonable defaults on both platforms while also allowing the user to tweak the options. It looks to me that it would be unlikely that you would care about the Windows tagname option, also, if you are cross platform, then just accept the Unix default for prot since you have no choice on Windows. That only leaves caring about MAP_PRIVATE and MAP_SHARED. The default is MAP_SHARED, but I'm not sure if that is the option that most closely matches Windows behavior, but accepting the default is probably fine there.

answered Oct 23 '22 17:10

Joshua D. Boyd

Related questions
                            
                                Why is "sys.argv" not available in Sublime API?
                            
                                How to get users desktop path in python independent of language install (linux)
                            
                                Build errors when trying to install pylibmc
                            
                                Python dictionary eating up ram
                            
                                AttributeError in tkinter
                            
                                Flask, nginx, and uwsgi
                            
                                Set up multiple python installations on windows with tox
                            
                                ImportError: cannot import name urandom
                            
                                News Scrolling Text in Python
                            
                                What is the correct way to close a Twisted conch SSH connection?
                            
                                gunicorn not serving static files
                            
                                Quiver or Barb with a date axis
                            
                                Kivy: crossplatform notification icon
                            
                                Update Cookies in Session Using python-requests Module
                            
                                Python bluetooth module lightblue doesn't work on mac osx 10.8
                            
                                How not to miss the next element after itertools.takewhile()
                            
                                A literal "*" in RestructuredText
                            
                                Python 3.4 multiprocessing Queue faster than Pipe, unexpected
                            
                                Check what a running process is doing: print stack trace of an uninstrumented Python program
                            
                                Setting freq of pandas DatetimeIndex after DataFrame creation

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With