To encode strings in json, several reserved characters need to be escaped with a backslash, and each string needs to be wrapped in double quotes. Currently the jsonlite
package implements this using the deparse
function in base R:
deparse_vector <- function(x) {
stopifnot(is.character(x))
vapply(x, deparse, character(1), USE.NAMES=FALSE)
}
This does the trick:
test <- c("line\nline", "foo\\bar", "I said: \"hi!\"")
cat(deparse_vector(test))
However deparse
turns out to be slow for large vectors. An alternative implementation is to gsub
each character individually:
deparse_vector2 <- function(x) {
stopifnot(is.character(x))
if(!length(x)) return(x)
x <- gsub("\\", "\\\\", x, fixed=TRUE)
x <- gsub("\"", "\\\"", x, fixed=TRUE)
x <- gsub("\n", "\\n", x, fixed=TRUE)
x <- gsub("\r", "\\r", x, fixed=TRUE)
x <- gsub("\t", "\\t", x, fixed=TRUE)
x <- gsub("\b", "\\b", x, fixed=TRUE)
x <- gsub("\f", "\\f", x, fixed=TRUE)
paste0("\"", x, "\"")
}
This is a bit faster, but not much and a bit ugly too. What would be a better way to do this? (preferably without additional dependencies)
This script can be used to compare the implementations:
> system.time(out1 <- deparse_vector1(strings))
user system elapsed
6.517 0.000 6.523
> system.time(out2 <- deparse_vector2(strings))
user system elapsed
1.194 0.000 1.194
Here's a C++ version of Winston's code. It's quite a lot simpler because you can efficiently grow std::string
s. It's also less likely to crash because Rcpp takes care of memory management for you.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
std::string escape_one(std::string x) {
std::string out = "\"";
int n = x.size();
for (int i = 0; i < n; ++i) {
char cur = x[i];
switch(cur) {
case '\\': out += "\\\\"; break;
case '"': out += "\\\""; break;
case '\n': out += "\\n"; break;
case '\r': out += "\\r"; break;
case '\t': out += "\\t"; break;
case '\b': out += "\\b"; break;
case '\f': out += "\\f"; break;
default: out += cur;
}
}
out += '"';
return out;
}
// [[Rcpp::export]]
CharacterVector escape_chars(CharacterVector x) {
int n = x.size();
CharacterVector out(n);
for (int i = 0; i < n; ++i) {
String cur = x[i];
out[i] = escape_one(cur);
}
return out;
}
On your benchmark, deparse_vector2(strings)
takes 0.8s, and escape_chars(strings)
takes 0.165s.
I don't know of a faster way to do this with just R code, but I did decide to try my hand at implementing it in C, wrapped in an R function called deparse_vector3
. It's rough (and I'm far from an expert C programmer) but it seems to work for your examples: https://gist.github.com/wch/e3ec5b20eb712f1b22b2
On my system (Mac, R 3.1.1), deparse_vector2
is over 20x faster than deparse_vector
, which is a much bigger difference than the 5x you got in your test.
My deparse_vector3
function is just 3x faster than deparse_vector2
. There's probably room for improvement.
> system.time(out1 <- deparse_vector1(strings))
user system elapsed
8.459 0.009 8.470
> system.time(out2 <- deparse_vector2(strings))
user system elapsed
0.368 0.007 0.374
> system.time(out3 <- deparse_vector3(strings))
user system elapsed
0.120 0.001 0.120
I don't think this will correctly handle non-ASCII character encodings, though. Here's an example of how encodings are handled in the R source: https://github.com/wch/r-source/blob/bfe73ecd848198cb9b68427cec7e70c40f96bd72/src/main/grep.c#L588-L630
Edit: This seems to handle UTF-8 OK, though it's possible I'm missing something in my testing.
You can also try stri_escape_unicode
from the stringi
package (although you preferred a solution without additional dependencies but I think it could be useful for future readers too) which about 3 times faster than deparse_vector2
and about 7 times faster than deparse_vector
require(stringi)
Defining the function
deparse_vector3 <- function(x){
paste0("\"",stri_escape_unicode(x), "\"")
}
Checking that all functions give smae result
all.equal(deparse_vector2(test), deparse_vector3(test))
## [1] TRUE
all.equal(deparse_vector(test), deparse_vector3(test))
## [1] TRUE
Some benchmarks
library(microbenchmark)
microbenchmark(deparse_vector(test),
deparse_vector2(test),
deparse_vector3(test), times = 1000L)
# Unit: microseconds
# expr min lq median uq max neval
# deparse_vector(test) 98.548 102.654 104.707 111.380 2500.653 1000
# deparse_vector2(test) 43.114 46.707 48.761 51.327 401.377 1000
# deparse_vector3(test) 14.885 16.938 18.991 20.018 240.211 1000 <-- Clear winner
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With