I have a custom sequence-like object, s
, that inherits collections.Sequence
and implements custom __len__
and __getitem__
. It represents a big blob of strings (>4GB) and is lazily loaded (I can't afford loading all into memory).
I'd like to do RE match on it, re.compile('some-pattern').match(s)
, but it fails with TypeError: expected string or buffer
.
In practice, pattern is not something like '.*'
that requires the entire s
to be loaded; it usually takes the first few tens of bytes to match; however, I can't tell beforehand the exact number of bytes and I want keep it general, therefore I don't want to do something like re.compile('some-pattern').match(s[:1000])
.
Any suggestions on how to create a str
-like object that is accepted by re
?
The following code illustrates my unsuccessful attempts. Inheriting from str
is not working either.
In [1]: import re, collections
In [2]: class MyStr(collections.Sequence):
def __len__(self): return len('hello')
def __getitem__(self, item): return 'hello'[item]
...:
In [3]: print(re.compile('h.*o').match(MyStr()))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-3-df08913b19d7> in <module>()
----> 1 print(re.compile('h.*o').match(MyStr()))
TypeError: expected string or buffer
If the big blob of string comes from a single big file then I can use mmap
and it should work. However, my case is more complicated. I have multiple big files, I mmap
ed each of them and have a custom class that is a concatenated view of them. I actually want to perform the RE match starting from any given position in the view. I omit such details in the original question, but I think it might be helpful to someone who wants to understand why I have such weird requirement.
Python has a module named re to work with RegEx. Here's an example: import re pattern = '^a...s$' test_string = 'abyss' result = re. match(pattern, test_string) if result: print("Search successful.") else: print("Search unsuccessful.")
The Python "re" module provides regular expression support. In Python a regular expression search is typically written as: match = re. search(pat, str) The re.search() method takes a regular expression pattern and a string and searches for that pattern within the string.
There is no special method you can implement that'll let re.match()
accept your custom class and not require that you read all data into memory.
That's because there currently is no special method that'll let your custom class act as a buffer-protocol object. re
methods only accept str
strings (which do implement the buffer protocol), and unicode
strings (and subclasses, data accessed directly, not via __unicode__
). The re
methods do not accept arbitrary sequences, and only the buffer protocol would let you avoid reading the whole thing into memory in one go.
Rather than try to implement a custom object, however, if your data is stored entirely in a single on-disk file (but is too large to read into memory), you want to use memory mapping. Memory mapping uses the virtual memory facilities of your OS to access portions of a file as sections of memory.
The virtual memory subsystem lets your OS manage more memory than your computer has physically available in the form of RAM, by putting chunks of memory ('pages') on to your harddisk instead. As memory is accessed, the OS keeps swapping out pages from disk to physical memory and back again. Memory mapping simply expands this functionality to existing files, making it possible to treat a very large file as a single, large string where the OS will ensure that parts that you try to access are available in memory when needed.
In Python, this functionality is available via the mmap
module, and a memory mapped file is implements the buffer protocol. You can pass such objects directly to re.match()
, and Python and your OS will work together to search the data in the file for a match.
So, given a large file filename = '/path/to/largefile'
and regular expression pattern
, this would search the file for a match at the start for your pattern:
import re
import mmap
import os
fd = os.open(filename, os.O_RDONLY)
mapped = mmap.mmap(fd, 0)
matched = re.match(pattern, mapped)
If you have multiple files, you need to find a way to concatenate them. Virtually, or physically. If you are using Linux, you can concatenate files virtually by using a network block device, or you can use a FUSE virtual file system. See A virtual file containing the concatenation of other files.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With