Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python re for custom sequence type

I have a custom sequence-like object, s, that inherits collections.Sequence and implements custom __len__ and __getitem__. It represents a big blob of strings (>4GB) and is lazily loaded (I can't afford loading all into memory).

I'd like to do RE match on it, re.compile('some-pattern').match(s), but it fails with TypeError: expected string or buffer.

In practice, pattern is not something like '.*' that requires the entire s to be loaded; it usually takes the first few tens of bytes to match; however, I can't tell beforehand the exact number of bytes and I want keep it general, therefore I don't want to do something like re.compile('some-pattern').match(s[:1000]).

Any suggestions on how to create a str-like object that is accepted by re?

The following code illustrates my unsuccessful attempts. Inheriting from str is not working either.

In [1]: import re, collections

In [2]: class MyStr(collections.Sequence):
    def __len__(self): return len('hello')
    def __getitem__(self, item): return 'hello'[item]
   ...:

In [3]: print(re.compile('h.*o').match(MyStr()))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-df08913b19d7> in <module>()
----> 1 print(re.compile('h.*o').match(MyStr()))

TypeError: expected string or buffer

If the big blob of string comes from a single big file then I can use mmap and it should work. However, my case is more complicated. I have multiple big files, I mmaped each of them and have a custom class that is a concatenated view of them. I actually want to perform the RE match starting from any given position in the view. I omit such details in the original question, but I think it might be helpful to someone who wants to understand why I have such weird requirement.

like image 987
Kan Li Avatar asked Apr 07 '19 19:04

Kan Li


People also ask

How do you add Re in Python?

Python has a module named re to work with RegEx. Here's an example: import re pattern = '^a...s$' test_string = 'abyss' result = re. match(pattern, test_string) if result: print("Search successful.") else: print("Search unsuccessful.")

What is re in Python?

The Python "re" module provides regular expression support. In Python a regular expression search is typically written as: match = re. search(pat, str) The re.search() method takes a regular expression pattern and a string and searches for that pattern within the string.


1 Answers

There is no special method you can implement that'll let re.match() accept your custom class and not require that you read all data into memory.

That's because there currently is no special method that'll let your custom class act as a buffer-protocol object. re methods only accept str strings (which do implement the buffer protocol), and unicode strings (and subclasses, data accessed directly, not via __unicode__). The re methods do not accept arbitrary sequences, and only the buffer protocol would let you avoid reading the whole thing into memory in one go.

Rather than try to implement a custom object, however, if your data is stored entirely in a single on-disk file (but is too large to read into memory), you want to use memory mapping. Memory mapping uses the virtual memory facilities of your OS to access portions of a file as sections of memory.

The virtual memory subsystem lets your OS manage more memory than your computer has physically available in the form of RAM, by putting chunks of memory ('pages') on to your harddisk instead. As memory is accessed, the OS keeps swapping out pages from disk to physical memory and back again. Memory mapping simply expands this functionality to existing files, making it possible to treat a very large file as a single, large string where the OS will ensure that parts that you try to access are available in memory when needed.

In Python, this functionality is available via the mmap module, and a memory mapped file is implements the buffer protocol. You can pass such objects directly to re.match(), and Python and your OS will work together to search the data in the file for a match.

So, given a large file filename = '/path/to/largefile' and regular expression pattern, this would search the file for a match at the start for your pattern:

import re
import mmap
import os

fd = os.open(filename, os.O_RDONLY)
mapped = mmap.mmap(fd, 0)
matched = re.match(pattern, mapped)

If you have multiple files, you need to find a way to concatenate them. Virtually, or physically. If you are using Linux, you can concatenate files virtually by using a network block device, or you can use a FUSE virtual file system. See A virtual file containing the concatenation of other files.

like image 169
Martijn Pieters Avatar answered Sep 18 '22 11:09

Martijn Pieters