Increase C++ regex replace performance

Tags:

I'm a beginner C++ programmer working on a small C++ project for which I have to process a number of relatively large XML files and remove the XML tags out of them. I've succeeded doing so using the C++0x regex library. However, I'm running into some performance issues. Just reading in the files and executing the regex_replace function over its contents takes around 6 seconds on my PC. I can bring this down to 2 by adding some compiler optimization flags. Using Python, however, I can get it done it less than 100 milliseconds. Obviously, I'm doing something very inefficient in my C++ code. What can I do to speed this up a bit?

My C++ code:

std::regex xml_tags_regex("<[^>]*>");

for (std::vector<std::string>::iterator it = _files.begin(); it != 
        _files.end(); it++) {

    std::ifstream file(*it);
    file.seekg(0, std::ios::end);
    size_t size = file.tellg();

    std::string buffer(size, ' ');

    file.seekg(0);
    file.read(&buffer[0], size);

    buffer = regex_replace(buffer, xml_tags_regex, "");

    file.close();
}

My Python code:

regex = re.compile('<[^>]*>')

for filename in filenames:
    with open(filename) as f:
        content = f.read()
        content = regex.sub('', content)

P.S. I don't really care about processing the complete file at once. I just found that reading a file line by line, word by word or character by character slowed it down considerably.

737

asked Sep 10 '14 13:09

user1219263

1 Answers

C++11 regex replace is indeed rather slow, as of yet, at least. PCRE performs much better in terms of pattern matching speed, however, PCRECPP provides very limited means for regular expression based substitution, citing the man page:

You can replace the first match of "pattern" in "str" with "rewrite". Within "rewrite", backslash-escaped digits (\1 to \9) can be used to insert text matching corresponding parenthesized group from the pattern. \0 in "rewrite" refers to the entire matching text.

This is really poor, compared to Perl's 's' command. That is why I wrote my own C++ wrapper around PCRE that handles regular expression based substitution in a fashion that is close to Perl's 's', and also supports 16- and 32-bit character strings: PCRSCPP:

Command string syntax

Command syntax follows Perl s/pattern/substitute/[options] convention. Any character (except the backslash \) can be used as a delimiter, not just /, but make sure that delimiter is escaped with a backslash (\) if used in pattern, substitute or options substrings, e.g.:

s/\\/\//g to replace all backslashes with forward ones

Remember to double backslashes in C++ code, unless using raw string literal (see string literal):

pcrscpp::replace rx("s/\\\\/\\//g");

Pattern string syntax

Pattern string is passed directly to pcre*_compile, and thus has to follow PCRE syntax as described in PCRE documentation.

Substitute string syntax

Substitute string backreferencing syntax is similar to Perl's:

$1 ... $n: nth capturing subpattern matched.

$& and $0: the whole match

${label} : labled subpattern matched. label is up to 32 alphanumerical + underscore characters ('A'-'Z','a'-'z','0'-'9','_'), first character must be alphabetical

$` and $' (backtick and tick) refer to the areas of the subject before and after the match, respectively. As in Perl, the unmodified subject is used, even if a global substitution previously matched.

Also, following escape sequences get recognized:

\n: newline

\r: carriage return

\t: horizontal tab

\f: form feed

\b: backspace

\a: alarm, bell

\e: escape

\0: binary zero

Any other escape sequence \<char>, is interpreted as <char>, meaning that you have to escape backslashes too

Options string syntax

In Perl-like manner, options string is a sequence of allowed modifier letters. PCRSCPP recognizes following modifiers:

Perl-compatible flags

g: global replace, not just the first match

i: case insensitive match
(PCRE_CASELESS)

m: multi-line mode: ^ and $ additionally match positions after and before newlines, respectively
(PCRE_MULTILINE)

s: let the scope of the . metacharacter include newlines (treat newlines as ordinary characters)
(PCRE_DOTALL)

x: allow extended regular expression syntax, enabling whitespace and comments in complex patterns
(PCRE_EXTENDED)

PHP-compatible flags

A: "anchor" pattern: look only for "anchored" matches: ones that start with zero offset. In single-line mode is identical to prefixing all pattern alternative branches with ^
(PCRE_ANCHORED)

D: treat dollar $ as subject end assertion only, overriding the default: end, or immediately before a newline at the end. Ignored in multi-line mode
(PCRE_DOLLAR_ENDONLY)

U: invert * and + greediness logic: make ungreedy by default, ? switches back to greedy. (?U) and (?-U) in-pattern switches remain unaffected
(PCRE_UNGREEDY)

u: Unicode mode. Treat pattern and subject as UTF8/UTF16/UTF32 string. Unlike in PHP, also affects newlines, \R, \d, \w, etc. matching
((PCRE_UTF8/PCRE_UTF16/PCRE_UTF32) | PCRE_NEWLINE_ANY | PCRE_BSR_UNICODE | PCRE_UCP)

PCRSCPP own flags:

N: skip empty matches
(PCRE_NOTEMPTY)

T: treat substitute as a trivial string, i.e., make no backreference and escape sequences interpretation

n: discard non-matching portions of the string to replace
Note: PCRSCPP does not automatically add newlines, the replacement result is plain concatenation of matches, be specifically aware of this in multiline mode

I wrote a simple speed test code, which stores a 10x copy of file "move.sh" and tests regex performance on resulting string:

#include <pcrscpp.h>
#include <string>
#include <iostream>
#include <fstream>
#include <regex>

#include <chrono>

int main (int argc, char *argv[]) {
    const std::string file_name("move.sh");
    pcrscpp::replace pcrscpp_rx(R"del(s/(?:^|\n)mv[ \t]+(?:-f)?[ \t]+"([^\n]+)"[ \t]+"([^\n]+)"(?:$|\n)/$1\n$2\n/Dgn)del");
    std::regex std_rx          (R"del((?:^|\n)mv[ \t]+(?:-f)?[ \t]+"([^\n]+)"[ \t]+"([^\n]+)"(?:$|\n))del");

    std::ifstream file (file_name);
    if (!file.is_open ()) {
        std::cerr << "Unable to open file " << file_name << std::endl;
        return 1;
    }
    std::string buffer;
    {
        file.seekg(0, std::ios::end);
        size_t size = file.tellg();
        file.seekg(0);
        if (size > 0) {
            buffer.resize(size);
            file.read(&buffer[0], size);
            buffer.resize(size - 1); // strip '\0'
        }
    }
    file.close();
    std::string bigstring;
    bigstring.reserve(10*buffer.size());
    for (std::string::size_type i = 0; i < 10; i++)
        bigstring.append(buffer);

    int n = 10;

    std::cout << "Running tests " << n << " times: be patient..." << std::endl;

    std::chrono::high_resolution_clock::duration std_regex_duration, pcrscpp_duration;
    std::chrono::high_resolution_clock::time_point t1, t2;
    std::string result1, result2;

    for (int i = 0; i < n; i++) {
        // clear result
        std::string().swap(result1);
        t1 = std::chrono::high_resolution_clock::now();
        result1 = std::regex_replace (bigstring, std_rx, "$1\\n$2", std::regex_constants::format_no_copy);
        t2 = std::chrono::high_resolution_clock::now();

        std_regex_duration = (std_regex_duration*i + (t2 - t1)) / (i + 1);

        // clear result
        std::string().swap(result2);

        t1 = std::chrono::high_resolution_clock::now();
        result2 = pcrscpp_rx.replace_copy (bigstring);
        t2 = std::chrono::high_resolution_clock::now();
        pcrscpp_duration = (pcrscpp_duration*i + (t2 - t1)) / (i + 1);
    }
    std::cout << "Time taken by std::regex_replace: "
              << std_regex_duration.count()
              << " ms" << std::endl
              << "Result size: " << result1.size() << std::endl;

    std::cout << "Time taken by pcrscpp::replace: "
              << pcrscpp_duration.count()
              << " ms" << std::endl
              << "Result size: " << result2.size() << std::endl;

    return 0;
}

(note that std and pcrscpp regular expressions do the same here, the trailing newline in expression for pcrscpp is due to std::regex_replace not stripping newlines despite std::regex_constants::format_no_copy)

and launched it on a large (20.9 MB) shell move script:

Running tests 10 times: be patient...
Time taken by std::regex_replace: 12090771487 ms
Result size: 101087330
Time taken by pcrscpp::replace: 5910315642 ms
Result size: 101087330

As you can see, PCRSCPP is more than 2x faster. And I expect this gap to grow with pattern complexity increase, since PCRE deals with complicated patterns much better. I originally wrote a wrapper for myself, but I think it can be useful for others too.

Regards, Alex

146

answered Oct 04 '22 03:10

Alex Potapenko

Related questions
                            
                                Python PyAudio installation problems (with PortAudio)
                            
                                python: pickling c objects
                            
                                Can't install numpy on Mountain Lion
                            
                                Is there an official/unofficial way to program Unity scripts in python? [closed]
                            
                                How to turn off transparency in Matplotlib's 3D Scatter plot?
                            
                                Fibonacci Rabbits Dying After Arbitrary # of Months
                            
                                Converting a python script to a web application [closed]
                            
                                What is uWSGI master mode?
                            
                                How to show source code of a package function in IPython notebook
                            
                                How to create a function at runtime with specified argument names?
                            
                                Fast detection or simulation of WSAECONNREFUSED
                            
                                Pycharm set the correct environment variable PATH
                            
                                Python easy_install in a virtualenv gives setuptools error
                            
                                Why do I get a "404 Not Found" error even though the link is on the server?
                            
                                How to install python-devel when using virtualenv
                            
                                Storing pandas DataFrames in SQLAlchemy models
                            
                                Getting "If-Match or If-None-Match header or entry etag attribute required" errors when batch deleting contacts
                            
                                Is there any analogue of EXIT_SUCCESS and EXIT_FAILURE macros in Python 2.7.6
                            
                                missing column after pandas groupby
                            
                                Python bcolz how to merge two ctables

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Increase C++ regex replace performance

Tags:

c++

performance

python

regex

replace

user1219263

People also ask

1 Answers

Command string syntax

Pattern string syntax

Substitute string syntax

Options string syntax

Alex Potapenko

Recent Activity

Donate For Us