For some reason a piece of code replaces spaces with \u00A0
- i.e. a Non-breaking space. This code is then used to sanitize a URL (yes I know that is very bad - in many ways). Strangely, when these are displayed in my test jsp a rogue Â
appears - why?
Sample JSP to demonstrate the issue.
<%@page contentType="text/html" pageEncoding="UTF-8"%>
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>JSP Page</title>
<%
String[] parameters = request.getParameterValues("p");
if (parameters == null || parameters.length == 0) {
parameters = new String[]{""};
}
%>
</head>
<body>
<h1>Hello World!</h1>
<a href='index.jsp?p=<%="Hello\u00A0there"%>'>A Link</a>
<p><%=parameters[0]%></p>
</body>
</html>
Why is the parameter showing as Hello there
? Where is the c2
coming from?
Added
BTW: The hex of the parameter
is 48 65 6c 6c 6f c2 a0 74 68 65 72 65
showing the c2
in-situ.
Rogue  appearing is most often an indication that something got encoded using UTF-8, and then decoded back again using a "traditional" code-page character set, e.g. ISO-8859-1, or CP850, or ...
To answer the actual question "Where is  (C2) coming from?", you may find this article helpful
Non-breaking space, 0x00A0
in UTF-16, is encoded as 0xC2A0
in UTF-8.
This table may help as well
Examples of encoded Unicode characters (in hexadecimal notation)
16-bit Unicode UTF-8 Sequence 0001 01 007F 7F 0080 C2 80 <-- this was the case of nbsp 07FF DF BF 0800 E0 A0 80 FFFF EF BF BF 010000 F0 90 80 80 10FFFF F4 8F BF BF
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With