This SO post has an example of a server that generates json with a byte order mark. RFC7159 says:
Implementations MUST NOT add a byte order mark to the beginning of a JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.
Currently yajl and hence jsonlite choke on the BOM. I would like to follow the RFC suggestion and ignore the BOM from the UTF8 string if present. What is an efficient way to do this? A naive implementation:
if(substr(json, 1, 1) == "\uFEFF"){
json <- substring(json, 2)
}
However substr is a bit slow for large strings, and I am not sure this is the correct way to do this. Is there a more efficient way in R or C to remove the BOM if present?
A simple solution:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
std::string stripBom(std::string x) {
if (x.size() < 3)
return x;
if (x[0] == '\xEF' && x[1] == '\xBB' && x[2] == '\xBF')
return x.substr(3);
return x;
}
/*** R
x <- "\uFEFFabcdef"
print(x)
print(stripBom(x))
identical(x, stripBom(x))
utf8ToInt(x)
utf8ToInt(stripBom(x))
*/
gives
> x <- "\uFEFFabcdef"
> print(x)
[1] "abcdef"
> print(stripBom(x))
[1] "abcdef"
> identical(x, stripBom(x))
[1] FALSE
> utf8ToInt(x)
[1] 65279 97 98 99 100 101 102
> utf8ToInt(stripBom(x))
[1] 97 98 99 100 101 102
EDIT: What might also be useful is seeing how R does it internally -- there are a number of situations where R strips BOM (e.g. for its scanners and file readers). See:
https://github.com/wch/r-source/blob/bfe73ecd848198cb9b68427cec7e70c40f96bd72/src/main/scan.c#L455-L458
https://github.com/wch/r-source/blob/bfe73ecd848198cb9b68427cec7e70c40f96bd72/src/main/connections.c#L3950-L3957
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With