How to parse a large file with regular expressions (using the re
module), without loading the whole file in string (or memory)? Memory mapped files don't help because their content can't be converted to some kind of lazy string. The re
module only supports string as content argument.
#include <boost/format.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/regex.hpp>
#include <iostream>
int main(int argc, char* argv[])
{
boost::iostreams::mapped_file fl("BigFile.log");
//boost::regex expr("\\w+>Time Elapsed .*?$", boost::regex::perl);
boost::regex expr("something usefull");
boost::match_flag_type flags = boost::match_default;
boost::iostreams::mapped_file::iterator start, end;
start = fl.begin();
end = fl.end();
boost::match_results<boost::iostreams::mapped_file::iterator> what;
while(boost::regex_search(start, end, what, expr))
{
std::cout<<what[0].str()<<std::endl;
start = what[0].second;
}
return 0;
}
To demonstrate my requirements. I wrote a short sample using C++(and boost) the same I want to have in Python.
The re. sub() function is used to replace occurrences of a particular sub-string with another sub-string. This function takes as input the following: The sub-string to replace.
Python has a module named re to work with RegEx. Here's an example: import re pattern = '^a...s$' test_string = 'abyss' result = re. match(pattern, test_string) if result: print("Search successful.") else: print("Search unsuccessful.")
The 'r' at the start of the pattern string designates a python "raw" string which passes through backslashes without change which is very handy for regular expressions (Java needs this feature badly!).
A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).
Everything now works ok(Python 3.2.3 has some differences with Python 2.7 in interface). Search patter should be just prefixed with b" to have a working solution(in Python 3.2.3).
import re
import mmap
import pprint
def ParseFile(fileName):
f = open(fileName, "r")
print("File opened succesfully")
m = mmap.mmap(f.fileno(), 0, access = mmap.ACCESS_READ)
print("File mapped succesfully")
items = re.finditer(b"\\w+>Time Elapsed .*?\n", m)
for item in items:
pprint.pprint(item.group(0))
if __name__ == "__main__":
ParseFile("testre")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With