Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are these symbols that crash URLDecoder with UTF-8?

I'm using URLDecoder to decode a string:

import java.net.URLDecoder;
URLDecoder.decode("%u6EDA%u52A8%u8F74%u627F", StandardCharsets.UTF_8.name());

Which leads to the crash

Exception in thread "main" java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in escape (%) pattern - For input string: "u6"
    at java.net.URLDecoder.decode(URLDecoder.java:194)
    at Playground$.delayedEndpoint$Playground$1(Playground.scala:45)
    at Playground$delayedInit$body.apply(Playground.scala:10)
    at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
    at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.collection.immutable.List.foreach(List.scala:392)
    at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
    at scala.App$class.main(App.scala:76)
    at Playground$.main(Playground.scala:10)
    at Playground.main(Playground.scala)

It seems like %u6 and %u8 are not allowed in the string. I've tried to read up on what these symbols are, but I've been unsuccessful. I found the string in a dataset in a field called "page title field". So I'm suspecting they are encoded symbols, I just don't know which encoding. Does anyone know what these symbols are and which encoding I should use to successfully decode them?

like image 977
Sahand Avatar asked Dec 28 '25 08:12

Sahand


1 Answers

Looks like a non-standard UTF-16-based encoding of "滚动轴承", which is Chinese for "ball bearings".

I'd suggest to just .replaceAll %u by backslashes, and then use StringEscapeUtils from Apache Commons:

import org.apache.commons.lang3.StringEscapeUtils
val unescapedJava = StringEscapeUtils.unescapeJava(str.replaceAll("%u", "\\u"))
URLDecoder.decode(unescapedJava, StandardCharsets.UTF_8.name())

This should handle both kinds of escaping:

  • The normal escape sequences with % followed by digits are unaffected by the replacement and unescapeJava
  • The weird %u are treated specially (replaced by \u), and eliminated in the first step.

If (only if) you are absolutely certain that all code points got encoded in this way, then you can do without StringEscapeUtils:

new String(
  "%u6EDA%u52A8%u8F74%u627F"
  .replaceAll("%u", "")
  .grouped(4)
  .map(Integer.parseInt(_, 16).toChar)
  .toArray
)

which produces

res: String = 滚动轴承

but I'd advice against it, because this method will break down for inputs like "%u6EDA%u52A8%u8F74%u627Fcafebabe" that contain unescaped characters. Better use a reliable library method that handles all corner cases.

like image 86
Andrey Tyukin Avatar answered Dec 30 '25 23:12

Andrey Tyukin



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!