Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python big file parsing with re

Tags:

python

regex

file

How to parse a large file with regular expressions (using the re module), without loading the whole file in string (or memory)? Memory mapped files don't help because their content can't be converted to some kind of lazy string. The re module only supports string as content argument.

#include <boost/format.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/regex.hpp>
#include <iostream>

int main(int argc, char* argv[])
{
    boost::iostreams::mapped_file fl("BigFile.log");
    //boost::regex expr("\\w+>Time Elapsed .*?$", boost::regex::perl);
    boost::regex expr("something usefull");
    boost::match_flag_type flags = boost::match_default;
    boost::iostreams::mapped_file::iterator start, end;
    start = fl.begin();
    end = fl.end();
    boost::match_results<boost::iostreams::mapped_file::iterator> what;
    while(boost::regex_search(start, end, what, expr))
    {
        std::cout<<what[0].str()<<std::endl;
        start = what[0].second;
    }
    return 0;
}

To demonstrate my requirements. I wrote a short sample using C++(and boost) the same I want to have in Python.

like image 349
Alex Avatar asked Jul 26 '12 17:07

Alex


People also ask

What does re sub () do?

The re. sub() function is used to replace occurrences of a particular sub-string with another sub-string. This function takes as input the following: The sub-string to replace.

How do you use re in Python?

Python has a module named re to work with RegEx. Here's an example: import re pattern = '^a...s$' test_string = 'abyss' result = re. match(pattern, test_string) if result: print("Search successful.") else: print("Search unsuccessful.")

What is r in re search Python?

The 'r' at the start of the pattern string designates a python "raw" string which passes through backslashes without change which is very handy for regular expressions (Java needs this feature badly!).

Why is re used in Python?

A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).


1 Answers

Everything now works ok(Python 3.2.3 has some differences with Python 2.7 in interface). Search patter should be just prefixed with b" to have a working solution(in Python 3.2.3).

import re
import mmap
import pprint

def ParseFile(fileName):
    f = open(fileName, "r")
    print("File opened succesfully")
    m = mmap.mmap(f.fileno(), 0, access = mmap.ACCESS_READ)
    print("File mapped succesfully")
    items = re.finditer(b"\\w+>Time Elapsed .*?\n", m)
    for item in items:
        pprint.pprint(item.group(0))

if __name__ == "__main__":
    ParseFile("testre")
like image 178
Alex Avatar answered Oct 26 '22 21:10

Alex