Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I convert a boost::spirit::lex token's value from iterator_range to a string?

When I try to convert the value of a token from iterator_range, the lexer fails when attempting to read the next token.

Here's the Tokens struct which holds the token definitions: (I don't think this is relevant but I'm including just in case.)

template <typename Lexer>
struct Tokens : boost::spirit::lex::lexer<Lexer>
{
    Tokens();

    boost::spirit::lex::token_def<std::string> identifier;
    boost::spirit::lex::token_def<std::string> string;
    boost::spirit::lex::token_def<bool> boolean;
    boost::spirit::lex::token_def<double> real;
    boost::spirit::lex::token_def<> comment;
    boost::spirit::lex::token_def<> whitespace;
};

template <typename Lexer>
Tokens<Lexer>::Tokens()
{
    // Define regex macros
    this->self.add_pattern
        ("LETTER", "[a-zA-Z_]")
        ("DIGIT", "[0-9]")
        ("INTEGER", "-?{DIGIT}+")
        ("FLOAT", "-?{DIGIT}*\\.{DIGIT}+");

    // Define the tokens' regular expressions
    identifier = "{LETTER}({LETTER}|{DIGIT})*";
    string = "\"[a-zA-Z_0-9]*\"";
    boolean = "true|false";
    real = "{INTEGER}|{FLOAT}";
    comment = "#[^\n\r\f\v]*$";
    whitespace = "\x20\n\r\f\v\t+";

    // Define tokens
    this->self
        = identifier
        | string
        | boolean
        | real
        | '{'
        | '}'
        | '<'
        | '>';

    // Define tokens to be ignored
    this->self("WS")
        = whitespace
        | comment;
}

Here's the definition of my token and lexer types:

typedef lex::lexertl::token<char const*> TokenType;
typedef lex::lexertl::actor_lexer<TokenType> LexerType;

Here's the code I'm using for reading a token and converting it's value to a string.

Tokens<LexerType> tokens;

std::string string = "9index";
char const* first = string.c_str();
char const* last = &first[string.size()];
LexerType::iterator_type token = tokens.begin(first, last);
LexerType::iterator_type end = tokens.end();

//typedef boost::iterator_range<char const*> iterator_range;
//const iterator_range& range = boost::get<iterator_range>(token->value());
//std::cout << std::string(range.begin(), range.end()) << std::endl;

++token;

token_is_valid(*token); // Returns false ONLY if I uncomment the above code

The output of this code is "9" (it read the first number, leaving "index" in the stream). If I print out the value of string(first, last) at this point, it shows "ndex". For some reason the lexer is failing on that 'i' character?

I've even tried using a std::stringstream to do the conversion, but this also causes the next token to be invalid:

std::stringstream out;
out << token->value();
std::cout << out.str() << std::endl;

++token;

token_is_valid(*token); // still fails

Finally, the next token IS valid if I simply send the token's value to cout:

std::cout << token->value() << std::endl;

++token;

token_is_valid(*token); // success, what?

What am I missing about how iterator_range returned by token->value() works? Neither of the methods I used for converting it to a string appear to modify the integer_range or the lexer's input stream of characters.

edit: I'm adding this here since a comment reply is too short to fully explain what happened.

I figured it out. As sehe and drhirsch pointed out, the code in my original question was a sterilized version of what I'm actually doing. I'm testing the lexer using gtest unit tests with a test fixture class. As a member of that class, I have void scan(const std::string& str) which assigns the first and last iterators (data members of the fixture) from the given string. Problem is as soon as we exit this function the const std::string& str parameter is popped off the stack and no longer exists, invalidating these iterators even though they're data members of the fixture.

Moral of the story: The object to which the iterators passed to lexer::begin() refer should exist as long as you expect to be reading tokens.

I'd rather delete this question than document my silly mistake on the Internet, but to help the community I suppose I should leave it.

like image 314
May Oakes Avatar asked May 02 '12 03:05

May Oakes


1 Answers

Judging from the given code, you appear to be looking at a compiler/library bug. I can't reproduce the problem with any of the following combinations:

Edit Now includes clang++ and boost 1_49_0. Valgrind comes up clean for a selected number of tested cases.

  • clang++ 2.9, -O3, boost 1_46_1
  • clang++ 2.9, -O0, boost 1_46_1
  • clang++ 2.9, -O3, boost 1_48_0
  • clang++ 2.9, -O0, boost 1_48_0
  • clang++ 2.9, -O3, boost 1_49_0
  • clang++ 2.9, -O0, boost 1_49_0

  • gcc 4.4.5, -O0, boost 1_42_1

  • gcc 4.4.5, -O0, boost 1_46_1
  • gcc 4.4.5, -O0, boost 1_48_0
  • gcc 4.4.5, -O0, boost 1_49_0
  • gcc 4.4.5, -O3, boost 1_42_1
  • gcc 4.4.5, -O3, boost 1_46_1
  • gcc 4.4.5, -O3, boost 1_48_0
  • gcc 4.4.5, -O3, boost 1_49_0
  • gcc 4.6.1, -O0, boost 1_46_1
  • gcc 4.6.1, -O0, boost 1_48_0
  • gcc 4.6.1, -O0, boost 1_49_0
  • gcc 4.6.1, -O3, boost 1_42_1
  • gcc 4.6.1, -O3, boost 1_46_1
  • gcc 4.6.1, -O3, boost 1_48_0
  • gcc 4.6.1, -O3, boost 1_49_0

Full code tested:

#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/lex_lexertl.hpp>

namespace qi    = boost::spirit::qi;
namespace lex   = boost::spirit::lex;

template <typename Lexer>
struct Tokens : lex::lexer<Lexer>
{
    Tokens();

    lex::token_def<std::string> identifier;
    lex::token_def<std::string> string;
    lex::token_def<bool> boolean;
    lex::token_def<double> real;
    lex::token_def<> comment;
    lex::token_def<> whitespace;
};

template <typename Lexer>
Tokens<Lexer>::Tokens()
{
    // Define regex macros
    this->self.add_pattern
        ("LETTER", "[a-zA-Z_]")
        ("DIGIT", "[0-9]")
        ("INTEGER", "-?{DIGIT}+")
        ("FLOAT", "-?{DIGIT}*\\.{DIGIT}+");

    // Define the tokens' regular expressions
    identifier = "{LETTER}({LETTER}|{DIGIT})*";
    string = "\"[a-zA-Z_0-9]*\"";
    boolean = "true|false";
    real = "{INTEGER}|{FLOAT}";
    comment = "#[^\n\r\f\v]*$";
    whitespace = "\x20\n\r\f\v\t+";

    // Define tokens
    this->self
        = identifier
        | string
        | boolean
        | real
        | '{'
        | '}'
        | '<'
        | '>';

    // Define tokens to be ignored
    this->self("WS")
        = whitespace
        | comment;
}

////////////////////////////////////////////////
typedef lex::lexertl::token<char const*> TokenType;
typedef lex::lexertl::actor_lexer<TokenType> LexerType;

int main(int argc, const char *argv[])
{
    Tokens<LexerType> tokens;

    std::string string = "9index";
    char const* first = string.c_str();
    char const* last = &first[string.size()];
    LexerType::iterator_type token = tokens.begin(first, last);
    LexerType::iterator_type end = tokens.end();

    typedef boost::iterator_range<char const*> iterator_range;
    const iterator_range& range = boost::get<iterator_range>(token->value());
    std::cout << std::string(range.begin(), range.end()) << std::endl;

    ++token;

    // Returns false ONLY if I uncomment the above code
    std::cout << "Next valid: " << std::boolalpha << token_is_valid(*token) << '\n'; 

    return 0;
}
like image 119
sehe Avatar answered Oct 25 '22 10:10

sehe