Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

JSP not showing correct UTF-8 contents for HTML form POST

I'm using Java 11 with Tomcat 9 with the latest JSP/JSTL. I'm testing in Chrome 71 and Firefox 64.0 on Windows 10. I have the following test document:

<%@ page contentType="text/html; charset=UTF-8" %>
<%@ taglib prefix="c" uri="http://java.sun.com/jsp/jstl/core" %>
<!DOCTYPE html>
<html lang="en-US">
<head>
  <meta charset="UTF-8"/>
  <title>Hello</title>
</head>
<body>
  <c:if test="${not empty param.fullName}">
    <p>Hello, ${param.fullName}.</p>
  </c:if>

  <form>
    <div>
      <label>Full name: <input name="fullName" /></label>
    </div>
    <button>Say Hello</button>
  </form>
</body>
</html>

This is perhaps the simplest form possible. As you know the form method defaults to get, the form action defaults to "" (submitting to the same page), and the form enctype defaults to application/x-www-form-urlencoded.

If I enter the name "Flávio José" (a famous Brazilian forró singer and musícian) in the field and submit, the form is submitted via HTTP GET to the same page using hello.jsp?fullName=Fl%C3%A1vio+Jos%C3%A9. This is correct, and the page says:

Hello, Flávio José.

If I change the form method to post and enter the same name "Flávio José", the form contents are instead submitted via POST, with HTTP request contents:

fullName=Fl%C3%A1vio+Jos%C3%A9

This also appears correct. But this time the page says:

Hello, Flávio José.

Rather than seeing %C3%A as a sequence of UTF-8 octects, JSP seems to think that these are a series of ISO-8859-1 octets (or code page 1252 octets), and is therefore decoding them to the wrong character sequence.

But where is it getting ISO-8859-1? What is my JSP page lacking to indicate the correct encoding?

I'll note also that WHATWG specification says that application/x-www-form-urlencoded octets should be parsed as UTF-8 by default. Is the Java servlet specification simply broken? How do I work around this?

like image 832
Garret Wilson Avatar asked Jan 08 '19 15:01

Garret Wilson


1 Answers

This is caused by Tomcat, but the root problem is the Java Servlet 4 specification, which is incorrect and outdated.

Originally HTML 4.0.1 said that application/x-www-form-urlencoded encoded octets should be decoded as US-ASCII. The servlet specification changed this to say that, if the request encoding is not specified, the octets should be decoded as ISO-8859-1. Tomcat is simply following the servlet specification.

There are two problems with the Java servlet specification. The first is that the modern interpretation of application/x-www-form-urlencoded is that encoded octets should be decoded using UTF-8. The second problem is that tying the octet decoding to the resource charset confuses two levels of decoding.

Take another look at this POST content:

fullName=Fl%C3%A1vio+Jos%C3%A9

You'll notice that it is ASCII!! It doesn't matter if you consider the POST HTTP request charset to be ISO-8859-1, UTF-8, or US-ASCII—you'll still wind up with exactly the same Unicode characters before decoding the octets! What encoding is used to decode the encoding octets is completely separate.

As a further example, let's say I download a text file instructions.txt that is clearly marked as ISO-8859-1, and it contains the URI https://example.com/example.jsp?fullName=Fl%C3%A1vio+Jos%C3%A9. Just because the text file has a charset of ISO-8859-1, does that mean I need to decode %C3%A using ISO-8859-1? Of course not! The charset used for decoding URI characters is a separate level of decoding on top of the resource content type charset! Similarly the octets of values encoded in application/x-www-form-urlencoded should be decoded using UTF-8, regardless of the underlying charset of the resource.

There are several workarounds, some of them found at found by looking at the Tomcat character encoding FAQ to "use UTF-8 everywhere".

Set the request character encoding in your web.xml file.

Add the following to your WEB-INF/web.xml file:

<request-character-encoding>UTF-8</request-character-encoding>

This setting is agnostic of the servlet container implementation, and is defined forth in the servlet specification. (You should be able to alternatively put it in Tomcat's conf/web.xml file, if want a global setting and don't mind changing the Tomcat configuration.)

Set the SetCharacterEncodingFilter in your web.xml file.

Tomcat has a proprietary equivalent: use the org.apache.catalina.filters.SetCharacterEncodingFilter in the WEB-INF/web.xml file, as the Tomcat FAQ above mentions, and as illustrated by https://stackoverflow.com/a/37833977/421049, excerpted below:

<filter>
  <filter-name>setCharacterEncodingFilter</filter-name>
  <filter-class>org.apache.catalina.filters.SetCharacterEncodingFilter</filter-class>
  <init-param>
    <param-name>encoding</param-name>
    <param-value>UTF-8</param-value>
  </init-param>
</filter>

<filter-mapping>
  <filter-name>setCharacterEncodingFilter</filter-name>
  <url-pattern>/*</url-pattern>
</filter-mapping>

This will make your web application only work on Tomcat, so it's better to put this in the Tomcat installation conf/web.xml file instead, as the post above mentions. In fact Tomcat's conf/web.xml installations have these two sections, but commented out; simply uncomment them and things should work.

Force the request character encoding to UTF-8 in the JSP or servlet.

You can force the character encoding of the servlet request to UTF-8, somewhere early in the JSP:

<% request.setCharacterEncoding("UTF-8"); %>

But that is ugly, unwieldy, error-prone, and goes against modern best practices—JSP scriptlets shouldn't be used anymore.

Hopefully we can get a newer Java servlet specification to remove any relationship between the resource charset and the decoding of application/x-www-form-urlencoded octets, and simply state that application/x-www-form-urlencoded octets must be decoded as UTF-8, as is modern practice as clarified by the latest W3C and WHATWG specifications.

Update: I've updated the Tomcat FAQ on Character Encoding Issues with this information.

like image 152
Garret Wilson Avatar answered Nov 07 '22 20:11

Garret Wilson