Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular Expression For Parsing Data

Tags:

c++

regex

I am writing an application that reads some data from a simple text file. The data files, that I am interested in, have lines in the following form:

Mem(100) = 120
Mem(200) = 231
Mem(43) = 12
...
Mem(1293) = 12.54

So, as you can understand, the pattern of each line is something like

(\s)*(\t)*Mem([0-9]*) (\s,\t)*= (\s,\t)*[0-9]*(.)*[0-9]*

like I have any number of whitespaces before the character sequence "Mem", followed by a left parenthesis. Then, there is a number and a right parenthesis. Afterwards, there is any number of white spaces until an '=' (equals) character is encountered. Then, any number of white spaces until I come across a (possibly) floating point number.

How can I express that in a C++ regex pattern? I am really new to the regular expression concept in C++ so I would need some help.

Thank you

like image 867
nick.katsip Avatar asked Oct 11 '13 21:10

nick.katsip


1 Answers

First of all, remember to #include <regex>.

C++ std::regex_match works like the regular expressions in other languages.

Let's start with a simple example:

std::string str = "Mem(100)=120";
std::regex regex("^Mem\\([0-9]+\\)=[0-9]+$");
std::cout << std::regex_match(str, regex) << std::endl;

In this case, our regex is ^Mem\([0-9]+\)=[0-9]+$. Let's take a look at what it does:

  • The ^ at the beginning tells C++ this is where the line starts, so AMem(1)=2 should not match.
  • The $ at the end tells C++ this is where the line ends, so Mem(1)=2x should not match.
  • \\( is a literal ( character. ( has a very special meaning in regular expressions, so we escape it \(. However, the \ character has a special meaning in C++ strings, so we use \\( to tell C++ to pass the \( to the regular expression engine.
  • [0-9] matches a digit. \\dshould also work, but then again maybe not.
  • [0-9]+ means at least one digit. If Mem() is acceptable, then use [0-9]* instead.

As you can see, this is just like the regular expressions you'd find in other languages (such as Java or C# ).

Now, to consider whitespace, use std::regex regex("^\\s*Mem\\([0-9]+\\)\\s*=\\s*[0-9]+\\s*$");

Note that \s includes \t, so no need to specify both. If it didn't, you'd use (\s|\t) or [\s\t], not (\s,\t).

Finally, to include float numbers, we first need to think if Mem(1) = 1. (that is, a dot without a number after it) is acceptable.

If it isn't, then the .23 in 1.23 is optional. In regexes, we use ? to indicate that.

std::regex regex("^[\\s]*Mem\\([0-9]+\\)\\s*=\\s*[0-9]+(\\.[0-9]+)?\\s*$");

Note that we use \. instead of just .. . has a special meaning in regular expressions - it matches any character - so we need to escape it.

If you have a compiler that supports raw strings (e.g. Visual Studio 2013, GCC 4.5, Clang 3.0), you can simplify the regex string:

std::regex regex(R"(^[\s]*Mem\([0-9]+\)\s*=\s*[0-9]+(\.[0-9]+)?\s*$)")

To extract information about the matched string, you can use std::smatch and groups.

Let's start with a small change:

std::string str = " Mem(100)=120";
std::regex regex("^[\\s]*Mem\\(([0-9]+)\\)\\s*=\\s*([0-9]+(\\.[0-9]+)?)\\s*$");
std::smatch m;

std::cout << std::regex_match(str, m, regex) << std::endl;

Note three things:

  1. We added smatch. This class stores extra result info about the match.
  2. We added additional parenthesis around [0-9]*. This defines a group. Groups tell the regex engine to keep track of whatever is within them.
  3. Yet more parenthesis around the floating point number. This defines a second group.

Very importantly the parenthesis that define groups are NOT escaped since we don't want them to match actual parenthesis characters. We actually want the special regex meaning.

Now that we have the groups, we can use them:

for (auto result : m) {
    std::cout << result << std::endl;
}

This will first print the whole string, then the number in Mem(), then the final number.

In other words, m[0] gives us the whole match, m[1] gives us the first group, m[2] gives us the second group and m[3] would give us the third group if we had one.

like image 67
luiscubal Avatar answered Oct 06 '22 10:10

luiscubal