Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In C++11 what is the most performant way to return a reference/pointer to a position in a std::string?

I'm building a text parser that uses std::string as the core storage for strings.

I know this is not optimal and that parsers inside compilers use optimzed approaches for this. In my project I don't mind losing some performance in exchange for more clarity and easier maintenance.

At the beginning I read a huge text into memory and then I scan each character to build a ordered set of tokens, its a simple lexer. Currently I'm using std::string to represent the text of a token but I would like to improve this a bit by using a reference/pointer into the original text.

From what I have read it is a bad practice to return and hold to iterators and it is also a bad practice to refer to the std::string internal buffer.

Any suggestions on how to accomplish this in a "clean" way?

like image 406
Pedro Salgueiro Avatar asked Jul 15 '14 15:07

Pedro Salgueiro


People also ask

How to return the value of a pointer in C++?

To get the value pointed to by a pointer, you need to use the dereferencing operator * (e.g., if pNumber is a int pointer, *pNumber returns the value pointed to by pNumber . It is called dereferencing or indirection).

What does string c_str return?

The c_str method of std::string returns a raw pointer to the memory buffer owned by the std::string . The pointer is only safe to use while the std::string is still in scope. When the std::string goes out of scope, its destructor is called and the memory is deallocated, so it is no longer safe to use the pointer.

Does std::string constructor copy?

std::string::string. Constructs an empty string, with a length of zero characters. Constructs a copy of str. Copies the portion of str that begins at the character position pos and spans len characters (or until the end of str, if either str is too short or if len is string::npos).

How do you return a string reference in C++?

Use the std::string func() Notation to Return String From Function in C++ Return by the value is the preferred method for returning string objects from functions. Since the std::string class has the move constructor, returning even the long strings by value is efficient.


3 Answers

There are proposals to add string_view to C++ in an upcoming standard.

A string_view is a non-owning iterable range over characters with many of the utilities and properties you'd expect of a string class, except you cannot insert/delete characters (and editing characters is often blocked in some subtypes).

I would advise trying that approach -- write your own (in your own utility namespace). (You should have your own utility namespace for reusable code snippets anyhow).

The core data is a pair of char* pr std::string::iterators (or const versions). If the user needs a null terminated buffer, a to_string method allocates one. I would start with non-mutable (const) character data. Do not forget begin and end: that makes your view iterable with for(:) loops.

This design has the danger that the original std::string has to persist long enough to outlast all of the views.

If you are willing to give up some performance for safety, have the view own a std::shared_ptr<const std::string> that it can move a std::string into, and as a first step move the entire buffer into it, and then start chopping/parsing it down. (child views make a new shared pointer to same data). Then your view class is more like a non-mutable string with shared storage.

The upsides to the shared_ptr<const> version include safety, longer lifetime of the views (there is no more lifetime dependency), and you can easily forward your const "substring" type methods to the std::string so you can write less code.

Downsides include possible incompatibility with incoming standard one1, and lower performance because you are dragging a shared_ptr around.

I suspect views and ranges are going to be increasingly important in modern C++ with the upcoming and recent improvements to the language.

boost::string_ref is apparently an implementation of a proposal to the C++1y standard.


1 however, given how simple it is to add capabilities in template metaprogramming, having a "resource owner" template argument to a view type might be a good design decision. Then you can have owning and non-owning string_views with otherwise identical semantics...

like image 190
Yakk - Adam Nevraumont Avatar answered Nov 10 '22 00:11

Yakk - Adam Nevraumont


Some through here:

-Internal representation of the string live the same time that the string himself, if you save pointer or iterators to the string to use latter (ex: print reports, postprocessing etc...) to the scope of the string your would face invalid memory access. Normally in this type of processing the text live all the time of the process.
-Iterators is a good choices (for extreme performance and generality I suggest use of const raw pointer const char*, because the origin could be almost anything, string, buffer, mapped buffer, readed data from stream, etc...)
-A good practice is instead of copying the tokens, save a pair (token begin iterator, token end iterator) in a collection of tokens.
-It is imperative for performance trying not to make a lot of allocations (alloc is one of the most expensive operation in any language)

You could check lexertl (for more ideas or for use it): http://www.benhanson.net/lexertl.html and spirit (more complete): http://www.boost.org/doc/libs/release/libs/spirit/

like image 40
NetVipeC Avatar answered Nov 10 '22 01:11

NetVipeC


Returning and using iterators is not a bad practice. Of course assuming that you are not modifying the input buffer, but it does not look like you are.

like image 23
Wojtek Surowka Avatar answered Nov 10 '22 00:11

Wojtek Surowka