Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Simple string parsing with C++

Tags:

c++

People also ask

How do you parse a string in C?

Splitting a string using strtok() in C In C, the strtok() function is used to split a string into a series of tokens based on a particular delimiter. A token is a substring extracted from the original string.

How do you parse a string?

String parsing in java can be done by using a wrapper class. Using the Split method, a String can be converted to an array by passing the delimiter to the split method. The split method is one of the methods of the wrapper class. String parsing can also be done through StringTokenizer.

Can you parse in C?

Fortunately, the C programming language has a standard C library function to do just that. The strtok function breaks up a line of data according to "delimiters" that divide each field. It provides a streamlined way to parse data from an input string.

What is parsing function in C?

To parse, in computer science, is where a string of commands – usually a program – is separated into more easily processed components, which are analyzed for correct syntax and then attached to tags that define each component. The computer can then process each program chunk and transform it into machine language.


This is a try using only standard C++.

Most of the time I use a combination of std::istringstream and std::getline (which can work to separate words) to get what I want. And if I can I make my config files look like:

foo=1,2,3,4

which makes it easy.

text file is like this:

foo=1,2,3,4
bar=0

And you parse it like this:

int main()
{
    std::ifstream file( "sample.txt" );

    std::string line;
    while( std::getline( file, line ) )   
    {
        std::istringstream iss( line );

        std::string result;
        if( std::getline( iss, result , '=') )
        {
            if( result == "foo" )
            {
                std::string token;
                while( std::getline( iss, token, ',' ) )
                {
                    std::cout << token << std::endl;
                }
            }
            if( result == "bar" )
            {
               //...
    }
}

The C++ String Toolkit Library (StrTk) has the following solution to your problem:

#include <string>
#include <deque>
#include "strtk.hpp"

int main()
{
   std::string file_name = "simple.txt";
   strtk::for_each_line(file_name,
                       [](const std::string& line)
                       {
                          std::deque<std::string> token_list;
                          strtk::parse(line,"[]: ",token_list);
                          if (token_list.empty()) return;

                          const std::string& key = token_list[0];

                          if (key == "foo")
                          {
                            //do 'foo' related thing with token_list[1] 
                            //and token_list[2]
                            return;
                          }

                          if (key == "bar")
                          {
                            //do 'bar' related thing with token_list[1]
                            return;
                          }

                       });

   return 0;
}

More examples can be found Here


Boost.Spirit is not reserved to parse complicated structure. It is quite good at micro-parsing too, and almost match the compactness of the C + scanf snippet :

#include <boost/spirit/include/qi.hpp>
#include <string>
#include <sstream>

using namespace boost::spirit::qi;


int main()
{
   std::string text = "foo: [3 4 5]\nbaz: 3.0";
   std::istringstream iss(text);

   std::string line;
   while (std::getline(iss, line))
   {
      int x, y, z;
      if(phrase_parse(line.begin(), line.end(), "foo: [">> int_ >> int_ >> int_ >> "]", space, x, y, z))
         continue;
      float w;
      if(phrase_parse(line.begin(), line.end(), "baz: ">> float_, space , w))
         continue;
   }
}

(Why they didn't add a "container" version is beyond me, it would be much more convenient if we could just write :

if(phrase_parse(line, "foo: [">> int_ >> int_ >> int_ >> "]", space, x, y, z))
   continue;

But it's true that :

  • It adds a lot of compile time overhead.
  • Error messages are brutal. If you make a small mistake with scanf, you just run your program and immediately get a segfault or an absurd parsed value. Make a small mistake with spirit and you will get hopeless gigantic error messages from the compiler and it takes a LOT of practice with boost.spirit to understand them.

So ultimately, for simple parsing I use scanf like everyone else...


Regular expressions can often be used for parsing strings. Use capture groups (parentheses) to get the various parts of the line being parsed.

For instance, to parse an expression like foo: [3 4 56], use the regular expression (.*): \[(\d+) (\d+) (\d+)\]. The first capture group will contain "foo", the second, third and fourth will contain the numbers 3, 4 and 56.

If there are several possible string formats that need to be parsed, like in the example given by the OP, either apply separate regular expressions one by one and see which one matches, or write a regular expression that matches all the possible variations, typically using the | (set union) operator.

Regular expressions are very flexible, so the expression can be extended to allow more variations, for instance, an arbitrary number of spaces and other whitespace after the : in the example. Or to only allow the numbers to contain a certain number of digits.

As an added bonus, regular expressions provide an implicit validation since they require a perfect match. For instance, if the number 56 in the example above was replaced with 56x, the match would fail. This can also simplify code as, in the example above, the groups containing the numbers can be safely cast to integers without any additional checking being required after a successful match.

Regular expressions usually run at good performance and there are many good libraries to chose from. For instance, Boost.Regex.