Python re for custom sequence type

Tags:

I have a custom sequence-like object, s, that inherits collections.Sequence and implements custom __len__ and __getitem__. It represents a big blob of strings (>4GB) and is lazily loaded (I can't afford loading all into memory).

I'd like to do RE match on it, re.compile('some-pattern').match(s), but it fails with TypeError: expected string or buffer.

In practice, pattern is not something like '.*' that requires the entire s to be loaded; it usually takes the first few tens of bytes to match; however, I can't tell beforehand the exact number of bytes and I want keep it general, therefore I don't want to do something like re.compile('some-pattern').match(s[:1000]).

Any suggestions on how to create a str-like object that is accepted by re?

The following code illustrates my unsuccessful attempts. Inheriting from str is not working either.

In [1]: import re, collections

In [2]: class MyStr(collections.Sequence):
    def __len__(self): return len('hello')
    def __getitem__(self, item): return 'hello'[item]
   ...:

In [3]: print(re.compile('h.*o').match(MyStr()))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-df08913b19d7> in <module>()
----> 1 print(re.compile('h.*o').match(MyStr()))

TypeError: expected string or buffer

If the big blob of string comes from a single big file then I can use mmap and it should work. However, my case is more complicated. I have multiple big files, I mmaped each of them and have a custom class that is a concatenated view of them. I actually want to perform the RE match starting from any given position in the view. I omit such details in the original question, but I think it might be helpful to someone who wants to understand why I have such weird requirement.

987

asked Apr 07 '19 19:04

Kan Li

1 Answers

There is no special method you can implement that'll let re.match() accept your custom class and not require that you read all data into memory.

That's because there currently is no special method that'll let your custom class act as a buffer-protocol object. re methods only accept str strings (which do implement the buffer protocol), and unicode strings (and subclasses, data accessed directly, not via __unicode__). The re methods do not accept arbitrary sequences, and only the buffer protocol would let you avoid reading the whole thing into memory in one go.

Rather than try to implement a custom object, however, if your data is stored entirely in a single on-disk file (but is too large to read into memory), you want to use memory mapping. Memory mapping uses the virtual memory facilities of your OS to access portions of a file as sections of memory.

The virtual memory subsystem lets your OS manage more memory than your computer has physically available in the form of RAM, by putting chunks of memory ('pages') on to your harddisk instead. As memory is accessed, the OS keeps swapping out pages from disk to physical memory and back again. Memory mapping simply expands this functionality to existing files, making it possible to treat a very large file as a single, large string where the OS will ensure that parts that you try to access are available in memory when needed.

In Python, this functionality is available via the mmap module, and a memory mapped file is implements the buffer protocol. You can pass such objects directly to re.match(), and Python and your OS will work together to search the data in the file for a match.

So, given a large file filename = '/path/to/largefile' and regular expression pattern, this would search the file for a match at the start for your pattern:

import re
import mmap
import os

fd = os.open(filename, os.O_RDONLY)
mapped = mmap.mmap(fd, 0)
matched = re.match(pattern, mapped)

If you have multiple files, you need to find a way to concatenate them. Virtually, or physically. If you are using Linux, you can concatenate files virtually by using a network block device, or you can use a FUSE virtual file system. See A virtual file containing the concatenation of other files.

169

answered Sep 18 '22 11:09

Martijn Pieters

Related questions
                            
                                Python Profiling: What does "method 'poll' of 'select.poll' objects"?
                            
                                TensorFlow freeze_graph.py: The name 'save/Const:0' refers to a Tensor which does not exist
                            
                                Binning of data along one axis in numpy
                            
                                Selenium chromedriver 2.25 TimeoutException cannot determine loading status
                            
                                How to query an advanced search with google customsearch API?
                            
                                "pip install jq" generates errors on Mac and Windows
                            
                                Python3 does not find modules installed by pip3
                            
                                Python parallel execution with selenium
                            
                                How to write pyspark dataframe to HDFS and then how to read it back into dataframe?
                            
                                Understanding LSTM model using tensorflow for sentiment analysis
                            
                                numpy astype from float32 to float16
                            
                                How to handle large amouts of data in tensorflow?
                            
                                django aws s3 image resize on upload and access to various resized image
                            
                                What is the difference between @jit and @vectorize in numba?
                            
                                Does asyncio from python support coroutine-based API for UDP networking?
                            
                                How does Python determine if two strings are identical
                            
                                How do you add folding to QsciLexerCustom subclass?
                            
                                Could not install packages due to an EnvironmentError: [Errno 2]
                            
                                Bash Operator error: No such file or directory in airflow
                            
                                How to use a prototbuf map in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python re for custom sequence type

Tags:

python

regex

python-2.7

Kan Li

People also ask

1 Answers

Martijn Pieters

Recent Activity

Donate For Us