Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Where is the  (C2) coming from

Tags:

java

url

jsp

For some reason a piece of code replaces spaces with \u00A0 - i.e. a Non-breaking space. This code is then used to sanitize a URL (yes I know that is very bad - in many ways). Strangely, when these are displayed in my test jsp a rogue  appears - why?

Sample JSP to demonstrate the issue.

<%@page contentType="text/html" pageEncoding="UTF-8"%>
<!DOCTYPE html>
<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <title>JSP Page</title>
    <%
      String[] parameters = request.getParameterValues("p");
      if (parameters == null || parameters.length == 0) {
        parameters = new String[]{""};
      }
    %>
  </head>
  <body>
    <h1>Hello World!</h1>
    <a href='index.jsp?p=<%="Hello\u00A0there"%>'>A Link</a>
    <p><%=parameters[0]%></p>
  </body>
</html>

Why is the parameter showing as Hello there? Where is the c2 coming from?

Added

BTW: The hex of the parameter is 48 65 6c 6c 6f c2 a0 74 68 65 72 65 showing the c2 in-situ.

like image 478
OldCurmudgeon Avatar asked Feb 25 '16 09:02

OldCurmudgeon


2 Answers

Rogue  appearing is most often an indication that something got encoded using UTF-8, and then decoded back again using a "traditional" code-page character set, e.g. ISO-8859-1, or CP850, or ...

like image 54
Erwin Smout Avatar answered Oct 22 '22 23:10

Erwin Smout


To answer the actual question "Where is  (C2) coming from?", you may find this article helpful
Non-breaking space, 0x00A0 in UTF-16, is encoded as 0xC2A0 in UTF-8.

This table may help as well

Examples of encoded Unicode characters (in hexadecimal notation)

16-bit Unicode    UTF-8 Sequence
0001              01
007F              7F
0080              C2 80   <-- this was the case of nbsp
07FF              DF BF
0800              E0 A0 80
FFFF              EF BF BF
010000            F0 90 80 80
10FFFF            F4 8F BF BF
like image 26
radoh Avatar answered Oct 23 '22 00:10

radoh