Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++ replace multiple strings in a string in a single pass

Given the following string, "Hi ~+ and ^*. Is ^* still flying around ~+?"

I want to replace all occurrences of "~+" and "^*" with "Bobby" and "Danny", so the string becomes:

"Hi Bobby and Danny. Is Danny still flying around Bobby?"

I would prefer not to have to call Boost replace function twice to replace the occurrences of the two different values.

like image 307
John Goodson Avatar asked Oct 08 '10 23:10

John Goodson


People also ask

How do I replace multiples in a string?

replace(/cat/gi, "dog"); // now str = "I have a dog, a dog, and a goat." str = str. replace(/dog/gi, "goat"); // now str = "I have a goat, a goat, and a goat." str = str. replace(/goat/gi, "cat"); // now str = "I have a cat, a cat, and a cat."

How do you replace multiple values?

Find and replace multiple values with nested SUBSTITUTE The easiest way to find and replace multiple entries in Excel is by using the SUBSTITUTE function. The formula's logic is very simple: you write a few individual functions to replace an old value with a new one.

What can you replace couple of characters in a string with?

01) Using replace() method The replace() method takes two parameters as input, the first is the pattern you want (to match in the string) to be replaced and the second parameter is the pattern (character) you want to replace with.

How do you replace values in a string?

The replace() method searches a string for a value or a regular expression. The replace() method returns a new string with the value(s) replaced. The replace() method does not change the original string.


3 Answers

I managed to implement the required replacement function using Boost.Iostreams. Specifically, the method I used was a filtering stream using regular expression to match what to replace. I am not sure about the performance on gigabyte sized files. You will need to test it of course. Anyway, here's the code:

#include <boost/regex.hpp>
#include <boost/iostreams/filter/regex.hpp>
#include <boost/iostreams/filtering_stream.hpp>
#include <iostream>

int main()
{
   using namespace boost::iostreams;

   regex_filter filter1(boost::regex("~\\+"), "Bobby");
   regex_filter filter2(boost::regex("\\^\\*"), "Danny");

   filtering_ostream out;
   out.push(filter1);
   out.push(filter2);
   out.push(std::cout);

   out << "Hi ~+ and ^*. Is ^* still flying around ~+?" << std::endl;

   // for file conversion, use this line instead:
   //out << std::cin.rdbuf();
}

The above prints "Hi Bobby and Danny. Is Danny still flying around Bobby?" when run, just like expected.

It would be interesting to see the performance results, if you decide to measure it.

Daniel

Edit: I just realized that regex_filter needs to read the entire character sequence into memory, making it pretty useless for gigabyte-sized inputs. Oh well...

like image 170
Daniel Lidström Avatar answered Oct 14 '22 01:10

Daniel Lidström


I did notice it's been a year since this was active, but for what it's worth. I came across an article on CodeProject today that claims to solve this problem - maybe you can use ideas from there:

I can't vouch for its correctness, but might be worth taking a look at. :)

The implementation surely requires holding the entire string in memory, but you can easily work around that (as with any other implementation that performs the replacements) as long as you can split the input into blocks and guarantee that you never split at a position that is inside a symbol to be replaced. (One easy way to do that in your case is to split at a position where the next char isn't any of the chars used in a symbol.)

--

There is a reason beyond performance (though that is a sufficient reason in my book) to add a "ReplaceMultiple" method to one's string library: Simply doing the replace operation N times is NOT correct in general.

If the values that are substituted for the symbols are not constrained, values can end up being treated as symbols in subsequent replace operations. (There could be situations where you'd actually want this, but there are definitely cases where you don't. Using strange-looking symbols reduces the severity of the problem, but doesn't solve it, and "is ugly" because the strings to be formatted may be user-defineable - and so should not require exotic characters.)

However, I suspect there is a good reason why I can't easily find a general multi-replace implementation. A "ReplaceMultiple" operation simply isn't (obviously) well-defined in general.

To see this, consider what it might mean to "replace 'aa' with '!' and 'baa' with '?' in the string 'abaa'"? Is the result 'ab!' or 'a?' - or is such a replacement illegal?

One could require symbols to be "prefix-free", but in many cases that'd be unacceptable. Say I want to use this to format some template text. And say my template is for code. I want to replace "§table" with a database table name known only at runtime. It'd be annoying if I now couldn't use "§t" in the same template. The templated script could be something completely generic, and lo-and-behold, one day I encounter the client that actually made use of "§" in his table names... potentially making my template library rather less useful.

A perhaps better solution would be to use a recursive-descent parser instead of simply replacing literals. :)

like image 34
The Dag Avatar answered Oct 14 '22 01:10

The Dag


Very late answer but none of answers so far give a solution.

With a bit of Boost Spirit Qi you can do this substitution in one pass, with extremely high efficiency.

#include <iostream>
#include <string>
#include <string_view>
#include <map>
#include <boost/spirit/include/qi.hpp>
#include <boost/fusion/adapted.hpp>

namespace bsq = boost::spirit::qi;

using SUBSTITUTION_MAP = std::map<std::string, std::string>;//,std::string>;

template <typename InputIterator>
struct replace_grammar
    : bsq::grammar<InputIterator, std::string()>
{
    replace_grammar(const SUBSTITUTION_MAP& substitution_items)
        : replace_grammar::base_type(main_rule)
        {
            for(const auto& [key, value] : substitution_items) {
                replace_items.add(key,value);
            }
            
            main_rule = *( replace_items [( [](const auto &val,  auto& context) {
                auto& res = boost::fusion::at_c<0>(context.attributes);
                res += val;  })]
                |
                bsq::char_ 
                [( [](const auto &val,  auto& context) {
                    auto& res = boost::fusion::at_c<0>(context.attributes);
                    res += val;  })] );
        }

private :
    bsq::symbols<char, std::string> replace_items;
    bsq::rule<InputIterator, std::string()> main_rule;

};


std::string replace_items(std::string_view input, const SUBSTITUTION_MAP& substitution_items)
{
    std::string result;
    result.reserve(input.size());
  
    using iterator_type = std::string_view::const_iterator;
    
    const replace_grammar<iterator_type> p(substitution_items);

    if (!bsq::parse(input.begin(), input.end(), p, result))
        throw std::logic_error("should not happen");

    return result;

}

int main()
{
    std::cout << replace_items("Hi ~+ and ^*. Is ^* still flying around ~+?",{{"~+", "Bobby"} , { "^*", "Danny"}});
}

The qi::symbol is essentially doing the job you ask for , i.e searching the given keys and replace with the given values. https://www.boost.org/doc/libs/1_79_0/libs/spirit/doc/html/spirit/qi/reference/string/symbols.html

As said in the doc it builds behind the scene a Ternary Search Tree, which means that it is more efficient that searching n times the string for each key.

like image 34
sandwood Avatar answered Oct 14 '22 00:10

sandwood