Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Simple JSON string escape for C++?

Tags:

c++

json

I'm having a very simple program that outputs simple JSON string that I manually concatenate together and output through the std::cout stream (the output really is that simple) but I have strings that could contain double-quotes, curly-braces and other characters that could break the JSON string. So I need a library (or a function more accurately) to escape strings accordingly to the JSON standard, as lightweight as possible, nothing more, nothing less.

I found a few libraries that are used to encode whole objects into JSON but having in mind my program is 900 line cpp file, I rather want to not rely on a library that is few times bigger then my program just to achieve something as simple as this.

like image 460
ddinchev Avatar asked Oct 11 '11 10:10

ddinchev


People also ask

How do I escape a string in JSON?

The only difference between Java strings and Json strings is that in Json, forward-slash (/) is escaped.

Does JSON have escape characters?

In JSON the only characters you must escape are \, ", and control codes. Thus in order to escape your structure, you'll need a JSON specific function.


4 Answers

Caveat

Whatever solution you take, keep in mind that the JSON standard requires that you escape all control characters. This seems to be a common misconception. Many developers get that wrong.

All control characters means everything from '\x00' to '\x1f', not just those with a short representation such as '\x0a' (also known as '\n'). For example, you must escape the '\x02' character as \u0002.

See also: ECMA-404 - The JSON data interchange syntax, 2nd edition, December 2017, Page 4

Simple solution

If you know for sure that your input string is UTF-8 encoded, you can keep things simple.

Since JSON allows you to escape everything via \uXXXX, even " and \, a simple solution is:

#include <sstream>
#include <iomanip>

std::string escape_json(const std::string &s) {
    std::ostringstream o;
    for (auto c = s.cbegin(); c != s.cend(); c++) {
        if (*c == '"' || *c == '\\' || ('\x00' <= *c && *c <= '\x1f')) {
            o << "\\u"
              << std::hex << std::setw(4) << std::setfill('0') << static_cast<int>(*c);
        } else {
            o << *c;
        }
    }
    return o.str();
}

Shortest representation

For the shortest representation you may use JSON shortcuts, such as \" instead of \u0022. The following function produces the shortest JSON representation of a UTF-8 encoded string s:

#include <sstream>
#include <iomanip>

std::string escape_json(const std::string &s) {
    std::ostringstream o;
    for (auto c = s.cbegin(); c != s.cend(); c++) {
        switch (*c) {
        case '"': o << "\\\""; break;
        case '\\': o << "\\\\"; break;
        case '\b': o << "\\b"; break;
        case '\f': o << "\\f"; break;
        case '\n': o << "\\n"; break;
        case '\r': o << "\\r"; break;
        case '\t': o << "\\t"; break;
        default:
            if ('\x00' <= *c && *c <= '\x1f') {
                o << "\\u"
                  << std::hex << std::setw(4) << std::setfill('0') << static_cast<int>(*c);
            } else {
                o << *c;
            }
        }
    }
    return o.str();
}

Pure switch statement

It is also possible to get along with a pure switch statement, that is, without if and <iomanip>. While this is quite cumbersome, it may be preferable from a "security by simplicity and purity" point of view:

#include <sstream>

std::string escape_json(const std::string &s) {
    std::ostringstream o;
    for (auto c = s.cbegin(); c != s.cend(); c++) {
        switch (*c) {
        case '\x00': o << "\\u0000"; break;
        case '\x01': o << "\\u0001"; break;
        ...
        case '\x0a': o << "\\n"; break;
        ...
        case '\x1f': o << "\\u001f"; break;
        case '\x22': o << "\\\""; break;
        case '\x5c': o << "\\\\"; break;
        default: o << *c;
        }
    }
    return o.str();
}

Using a library

You might want to have a look at https://github.com/nlohmann/json, which is an efficient header-only C++ library (MIT License) that seems to be very well-tested.

You can either call their escape_string() method directly (Note that this is a bit tricky, see comment below by Lukas Salich), or you can take their implementation of escape_string() as a starting point for your own implementation:

https://github.com/nlohmann/json/blob/ec7a1d834773f9fee90d8ae908a0c9933c5646fc/src/json.hpp#L4604-L4697

like image 146
vog Avatar answered Oct 10 '22 07:10

vog


I have written a simple JSON escape and unescaped functions. The code is public available in GitHub. For anyone interested here is the code:

enum State {ESCAPED, UNESCAPED};

std::string escapeJSON(const std::string& input)
{
    std::string output;
    output.reserve(input.length());

    for (std::string::size_type i = 0; i < input.length(); ++i)
    {
        switch (input[i]) {
            case '"':
                output += "\\\"";
                break;
            case '/':
                output += "\\/";
                break;
            case '\b':
                output += "\\b";
                break;
            case '\f':
                output += "\\f";
                break;
            case '\n':
                output += "\\n";
                break;
            case '\r':
                output += "\\r";
                break;
            case '\t':
                output += "\\t";
                break;
            case '\\':
                output += "\\\\";
                break;
            default:
                output += input[i];
                break;
        }

    }

    return output;
}

std::string unescapeJSON(const std::string& input)
{
    State s = UNESCAPED;
    std::string output;
    output.reserve(input.length());

    for (std::string::size_type i = 0; i < input.length(); ++i)
    {
        switch(s)
        {
            case ESCAPED:
                {
                    switch(input[i])
                    {
                        case '"':
                            output += '\"';
                            break;
                        case '/':
                            output += '/';
                            break;
                        case 'b':
                            output += '\b';
                            break;
                        case 'f':
                            output += '\f';
                            break;
                        case 'n':
                            output += '\n';
                            break;
                        case 'r':
                            output += '\r';
                            break;
                        case 't':
                            output += '\t';
                            break;
                        case '\\':
                            output += '\\';
                            break;
                        default:
                            output += input[i];
                            break;
                    }

                    s = UNESCAPED;
                    break;
                }
            case UNESCAPED:
                {
                    switch(input[i])
                    {
                        case '\\':
                            s = ESCAPED;
                            break;
                        default:
                            output += input[i];
                            break;
                    }
                }
        }
    }
    return output;
}
like image 20
mariolpantunes Avatar answered Oct 10 '22 07:10

mariolpantunes


to build on vog's answer:

generate a full jump table for characters 0 to 92 = null to backslash

// generate full jump table for c++ json string escape
// license is public domain or CC0-1.0
//var s = require('fs').readFileSync('case-list.txt', 'utf8');
var s = ` // escape hell...
        case '"': o << "\\\\\\""; break;
        case '\\\\': o << "\\\\\\\\"; break;
        case '\\b': o << "\\\\b"; break;
        case '\\f': o << "\\\\f"; break;
        case '\\n': o << "\\\\n"; break;
        case '\\r': o << "\\\\r"; break;
        case '\\t': o << "\\\\t"; break;
`;
const charMap = new Map();
s.replace(/case\s+'(.*?)':\s+o\s+<<\s+"(.*?)";\s+break;/g, (...args) => {
  const [, charEsc, replaceEsc ] = args;
  const char = eval(`'${charEsc}'`);
  const replace = eval(`'${replaceEsc}'`);
  //console.dir({ char, replace, });
  charMap.set(char, replace);
});
iMax = Math.max(
  0x1f, // 31. 0 to 31: control characters
  '""'.charCodeAt(0), // 34
  '\\'.charCodeAt(0), // 92
);
const replace_function_name = 'String_showAsJson';
const replace_array_name = replace_function_name + '_replace_array';
// longest replace (\u0000) has 6 chars + 1 null byte = 7 byte
var res = `\
// ${iMax + 1} * 7 = ${(iMax + 1) * 7} byte / 4096 page = ${Math.round((iMax + 1) * 7 / 4096 * 100)}%
char ${replace_array_name}[${iMax + 1}][7] = {`;
res += '\n  ';
let i, lastEven;
for (i = 0; i <= iMax; i++) {
  const char = String.fromCharCode(i);
  const replace = charMap.has(char) ? charMap.get(char) :
    (i <= 0x1f) ? '\\u' + i.toString(16).padStart(4, 0) :
    char // no replace
  ;
  const hex = '0x' + i.toString(16).padStart(2, 0);
  //res += `case ${hex}: o << ${JSON.stringify(replace)}; break; /`+`/ ${i}\n`;
  //if (i > 0) res += ',';
  //res += `\n  ${JSON.stringify(replace)}, // ${i}`;
  if (i > 0 && i % 5 == 0) {
    res += `// ${i - 5} - ${i - 1}\n  `;
    lastEven = i;
  }
  res += `${JSON.stringify(replace)}, `;
}
res += `// ${lastEven} - ${i - 1}`;
res += `\n};

void ${replace_function_name}(std::ostream & o, const std::string & s) {
  for (auto c = s.cbegin(); c != s.cend(); c++) {
    if ((std::uint8_t) *c <= ${iMax})
      o << ${replace_array_name}[(std::uint8_t) *c];
    else
      o << *c;
  }
}
`;

//console.log(res);
document.querySelector('#res').innerHTML = res;
<pre id="res"></pre>
like image 2
Mila Nautikus Avatar answered Oct 10 '22 07:10

Mila Nautikus


You didn't say exactly where those strings you're cobbling together are coming from, originally, so this may not be of any use. But if they all happen to live in the code, as @isnullxbh mentioned in this comment to an answer on a different question, another option is to leverage a lovely C++11 feature: Raw string literals.

I won't quote cppreference's long-winded, standards-based explanation, you can read it yourself there. Basically, though, R-strings bring to C++ the same sort of programmer-delimited literals, with absolutely no restrictions on content, that you get from here-docs in the shell, and which languages like Perl use so effectively. (Prefixed quoting using curly braces may be Perl's single greatest invention:)

my qstring = q{Quoted 'string'!};
my qqstring = qq{Double "quoted" 'string'!};
my replacedstring = q{Regexps that /totally/! get eaten by your parser.};
replacedstring =~ s{/totally/!}{(won't!)}; 
# Heh. I see the syntax highlighter isn't quite up to the challege, though.

In C++11 or later, a raw string literal is prefixed with a capital R before the double quotes, and inside the quotes the string is preceded by a free-form delimiter (one or multiple characters) followed by an opening paren.

From there on, you can safely write literally anything other than a closing paren followed by your chosen delimiter. That sequence (followed by a closing double quote) terminates the raw literal, and then you have a std::string that you can confidently trust will remain unmolested by any parsing or string processing.

"Raw"-ness is not lost in subsequent manipulations, either. So, borrowing from the chapter list for Crockford's How JavaScript Works, this is completely valid:

std::string ch0_to_4 = R"json(
[
    {"number": 0, "chapter": "Read Me First!"},
    {"number": 1, "chapter": "How Names Work"},
    {"number": 2, "chapter": "How Numbers Work"},
    {"number": 3, "chapter": "How Big Integers Work"},
    {"number": 4, "chapter": "How Big Floating Point Works"},)json";

std::string ch5_and_6 = R"json(
    {"number": 5, "chapter": "How Big Rationals Work"},
    {"number": 6, "chapter": "How Booleans Work"})json";

std::string chapters = ch0_to_4 + ch5_and_6 + "\n]";
std::cout << chapters;

The string 'chapters' will emerge from std::cout completely intact:

[
    {"number": 0, "chapter": "Read Me First!"},
    {"number": 1, "chapter": "How Names Work"},
    {"number": 2, "chapter": "How Numbers Work"},
    {"number": 3, "chapter": "How Big Integers Work"},
    {"number": 4, "chapter": "How Big Floating Point Works"},
    {"number": 5, "chapter": "How Big Rationals Work"},
    {"number": 6, "chapter": "How Booleans Work"}
]
like image 1
FeRD Avatar answered Oct 10 '22 06:10

FeRD