Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can't convert unicode symbols to cyrillic

I have a bunch of documents persisted in Apache Lucene with some names in russian, and when I'm trying to print them out it looks like this "\u0410\u0441\u043f\u0430\u0440", but not in cyrillic symbols. The project is in Scala. I've tried to fix this with Apache Commons unescapeJava method, but it didn't help. Are there any other options?

Updated: Project is writen with Spray framework and returns json like this.

{
  "id" : 0,
  "name" : "\u0410\u0441\u043f\u0430\u0440"
}
like image 298
4lex1v Avatar asked Feb 18 '23 13:02

4lex1v


1 Answers

I'm going to try to infer exactly what you are doing. You are using Spray, so I gather that you are using its json library "spray-json"

So I suppose that you have some instance of spray.json.JsObject, and that what you posted in your question is what you get as the output when printing this instance. Your json object is correct, the value of the name field has no embeded escaping, it is actually the conversion to string that escapes some unicode characters.

See the definition of printString here: https://github.com/spray/spray-json/blob/master/src/main/scala/spray/json/JsonPrinter.scala

I will also assume that when you tried to use unescapeJava, you applied it on the value of the name field, creating a new spray.json.JsObject instance that you then printed as before. Given that your json object does not actually have any escaping, this did absolutly nothing, and then when printing it the printer does the escaping as before, and you're back to square one.

As a side note, it's worth mentioning that the json spec does not mandate how characters are encoded: they can either be stored as their literal value, or as a unicode escape. By example the string "abc" could be described as just "abc", or as "\u0061\u0062\u0063". Either form is correct. It just happens that the author of spray-json decided to use the latter form for all non-ascii characters.

So now you ask, what can I do to work around this? You could ask the spray-json author to add an option that let's you specify that you don't want any unicode escaping. But I imagine that you want a solution right now.

The simplest thing to do is to just convert your object to a string (via JsValue.toString or JsValue.compactPrint or JsValue.prettyPrint), and then pass the result to unescapeJava. At least this will give you back your cyrillic original characters. But this is a bit gross, and actually quite dangerous as some characters are not safe to unescape inside a string literal. By example: \n will be unescaped to an actual return, and \u0022 will be unescaped to ". You can easily see how it will break your json document. But at the very least it will allow to confirm my theory (remember that I have been making assumptions about what exactly you are doing).

Now for a proper fix: you could simply extend JsonPrinter and override its printString method to remove the unicode escapting. Something like this (untested):

trait NoUnicodeEscJsonPrinter extends JsonPrinter {
  override protected def printString(s: String, sb: StringBuilder) {
    @tailrec
    def printEscaped(s: String, ix: Int) {
      if (ix < s.length) {
        s.charAt(ix) match {
          case '"' => sb.append("\\\"")
          case '\\' => sb.append("\\\\")
          case x if 0x20 <= x && x < 0x7F => sb.append(x)
          case '\b' => sb.append("\\b")
          case '\f' => sb.append("\\f")
          case '\n' => sb.append("\\n")
          case '\r' => sb.append("\\r")
          case '\t' => sb.append("\\t")
          case x => sb.append(x)
        }
        printEscaped(s, ix + 1)
      }
    }
    sb.append('"')
    printEscaped(s, 0)
    sb.append('"')
  }
}

trait NoUnicodeEscPrettyPrinter  extends PrettyPrinter with NoUnicodeEscJsonPrinter
object NoUnicodeEscPrettyPrinter extends NoUnicodeEscPrettyPrinter

trait NoUnicodeEscCompactPrinter   extends CompactPrinter  with NoUnicodeEscJsonPrinter
object NoUnicodeEscCompactPrinter  extends NoUnicodeEscCompactPrinter

Then you can do:

val json: JsValue = ...
val jsonString: String = NoUnicodeEscPrettyPrinter( json )

jsonString will contain your json document in pretty-print format and without any unicde escaping.

like image 184
Régis Jean-Gilles Avatar answered Feb 20 '23 08:02

Régis Jean-Gilles