I'm building a text parser that uses std::string
as the core storage for strings.
I know this is not optimal and that parsers inside compilers use optimzed approaches for this. In my project I don't mind losing some performance in exchange for more clarity and easier maintenance.
At the beginning I read a huge text into memory and then I scan each character to build a ordered set of tokens, its a simple lexer. Currently I'm using std::string
to represent the text of a token but I would like to improve this a bit by using a reference/pointer into the original text.
From what I have read it is a bad practice to return and hold to iterators and it is also a bad practice to refer to the std::string
internal buffer.
Any suggestions on how to accomplish this in a "clean" way?
To get the value pointed to by a pointer, you need to use the dereferencing operator * (e.g., if pNumber is a int pointer, *pNumber returns the value pointed to by pNumber . It is called dereferencing or indirection).
The c_str method of std::string returns a raw pointer to the memory buffer owned by the std::string . The pointer is only safe to use while the std::string is still in scope. When the std::string goes out of scope, its destructor is called and the memory is deallocated, so it is no longer safe to use the pointer.
std::string::string. Constructs an empty string, with a length of zero characters. Constructs a copy of str. Copies the portion of str that begins at the character position pos and spans len characters (or until the end of str, if either str is too short or if len is string::npos).
Use the std::string func() Notation to Return String From Function in C++ Return by the value is the preferred method for returning string objects from functions. Since the std::string class has the move constructor, returning even the long strings by value is efficient.
There are proposals to add string_view
to C++ in an upcoming standard.
A string_view
is a non-owning iterable range over characters with many of the utilities and properties you'd expect of a string class, except you cannot insert/delete characters (and editing characters is often blocked in some subtypes).
I would advise trying that approach -- write your own (in your own utility namespace). (You should have your own utility namespace for reusable code snippets anyhow).
The core data is a pair of char*
pr std::string::iterator
s (or const
versions). If the user needs a null terminated buffer, a to_string
method allocates one. I would start with non-mutable (const
) character data. Do not forget begin
and end
: that makes your view iterable with for(:)
loops.
This design has the danger that the original std::string
has to persist long enough to outlast all of the views.
If you are willing to give up some performance for safety, have the view own a std::shared_ptr<const std::string>
that it can move a std::string
into, and as a first step move the entire buffer into it, and then start chopping/parsing it down. (child views make a new shared pointer to same data). Then your view class is more like a non-mutable string with shared storage.
The upsides to the shared_ptr<const>
version include safety, longer lifetime of the views (there is no more lifetime dependency), and you can easily forward your const
"substring" type methods to the std::string
so you can write less code.
Downsides include possible incompatibility with incoming standard one1, and lower performance because you are dragging a shared_ptr
around.
I suspect views and ranges are going to be increasingly important in modern C++ with the upcoming and recent improvements to the language.
boost::string_ref
is apparently an implementation of a proposal to the C++1y standard.
1 however, given how simple it is to add capabilities in template metaprogramming, having a "resource owner" template argument to a view type might be a good design decision. Then you can have owning and non-owning string_view
s with otherwise identical semantics...
Some through here:
-Internal representation of the string live the same time that the string himself, if you save pointer or iterators to the string to use latter (ex: print reports, postprocessing etc...) to the scope of the string your would face invalid memory access. Normally in this type of processing the text live all the time of the process.
-Iterators is a good choices (for extreme performance and generality I suggest use of const raw pointer const char*
, because the origin could be almost anything, string, buffer, mapped buffer, readed data from stream, etc...)
-A good practice is instead of copying the tokens, save a pair (token begin iterator, token end iterator) in a collection of tokens.
-It is imperative for performance trying not to make a lot of allocations (alloc is one of the most expensive operation in any language)
You could check lexertl (for more ideas or for use it): http://www.benhanson.net/lexertl.html and spirit (more complete): http://www.boost.org/doc/libs/release/libs/spirit/
Returning and using iterators is not a bad practice. Of course assuming that you are not modifying the input buffer, but it does not look like you are.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With