Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to capture 0-2 groups in C++ regular expressions and print them?

Tags:

c++

regex

c++11

Edit 3

I went to the good'ol custom parsing approach as I got stuck with the regular expression. It didn't turn out to be that bad, as the file contents can be tokenized quite neatly and the tokens can be parsed in a loop with a very simple state machine. Those who want to check, there's a snippet of code doing this with range-for, ifstream iterators and custom stream tokenizer at my other question in Stackoverflow here. These techniques lessen considerably the complexity of doing a custom parser.

I'd like to tokenize file contents in first part in capture groups of two and then just line by line. I have like a semi-functional solution, but I'd like to learn how to make this better. That is, without "extra processing" to make-up my lack of knowledge with capture groups. Next some preliminaries and in the end a more exact question (the line

const std::regex expression("([^:]+?)(^:|$)");

...is the one I'd like to ask about in combination with processing the results of it).

The files which are basically defined like this:

definition_literal : value_literal
definition_literal : value_literal
definition_literal : value_literal
definition_literal : value_literal
HOW TO INTERPRET THE FOLLOWING SECTION OF ROWS
[DATA ROW 1]
[DATA ROW 2]
...
[DATA ROW n]

Where each of the data rows consists of a certain number of either integers or floating point numbers separated by whitespace. Each row having as many numbers as the others (e.g. each row could have four integers). So, the "interpretation section" basically tells this format in plain text in one row.

I have an almost working solution that reads such files like this:

int main() 
{
    std::ifstream file("xyz", std::ios_base::in);
    if(file.good())
    {
        std::stringstream file_memory_buffer;
        file_memory_buffer << file.rdbuf();
        std::string str = file_memory_buffer.str(); 
        file.close();

        const std::regex expression("([^:]+?)(^:|$)");
        std::smatch result;

        const std::sregex_token_iterator end;       
        for(std::sregex_token_iterator i(str.begin(), str.end(), expression); i != end; ++i)
        {
            std::cout << (*i) << std::endl;
        }
    }

    return EXIT_SUCCESS;
}

With the regex defined expression, it now prints the <value> parts of the definition file, then the interpretation part and then the data rows one by one. If I change the regex to

"([^:]+?)(:|$)"

...it prints all the lines tokenized in groups of one, almost like I would like to, but how tokenize the first part in groups of two and the rest line by line?

Any pointers, code, explanations are truly welcomed. Thanks.

EDIT:

As noted to Tom Kerr already, but some additional points, this is also a rehearsal, or coding kata if you will, to not to write a custom parser, but to see if I could -- or we could :-) -- accomplish this with regex. I know regex isn't the most efficient thing to do here, but it doesn't matter.

What I'd hope to have is something like a list of tuples of header information (tuple of size 2), then the INTERPRET line (tuple of size 1), which I could use to choose a function on what to do with the data lines (tuple of size 1).

Yep, the "HOW TO INTERPRET" line is contained in a set of well-defined strings and I could just read line by line from the beginning, splitting strings along the way, until one of the INTERPRET lines is met. This regex solution is not the most efficient method, I know, but more like coding kata to get myself to write something else than customer parsers (and it's quite some time I've write in C++ the last time, so this is rehearsing otherwise too).

EDIT 2

I have managed to get access to the tuples (in the context of this question) by changing the iterator type, like so

const std::sregex_iterator end;     
for(std::sregex_iterator i(str.begin(), str.end(), expression); i != end; ++i)
{
    std::cout << "0: " << (*i)[0] << std::endl;
    std::cout << "1: " << (*i)[1] << std::endl;
    std::cout << "2: " << (*i)[2] << std::endl;
    std::cout << "***" << std::endl;
}

Though this is still way off what I'd like to have, there's something wrong with the regular expression I'm trying ot use. In any event, this new find, another kind of iterator, helps too.

like image 727
Veksi Avatar asked Jun 29 '12 22:06

Veksi


1 Answers

I believe the re you are attempting is this:

TEST(re) {
    static const boost::regex re("^([^:]+) : ([^:]+)$");

    std::string str = "a : b";
    CHECK(boost::regex_match(str, re));
    CHECK(!boost::regex_match("a:a : bbb", re));
    CHECK(!boost::regex_match("aaa : b:b", re));

    boost::smatch what;
    CHECK(boost::regex_match(str, what, re, boost::match_extra));
    CHECK_EQUAL(3, what.size());
    CHECK_EQUAL(str, what[0]);
    CHECK_EQUAL("a", what[1]);
    CHECK_EQUAL("b", what[2]);
}

I'm not sure I would recommend regex in this instance though. I think you'll find simply reading a line at a time, splitting on :, and then trimming the spaces more manageable.

I guess if you can't depend the below line as a sentinel, then it would be more difficult. Usually I would expect a format like this to be obvious from that line, not the format of each line of the header.

HOW TO INTERPRET THE FOLLOWING SECTION OF ROWS
like image 126
Tom Kerr Avatar answered Nov 14 '22 23:11

Tom Kerr