Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode issue with an HTML Title, question mark? 65533;

I'm trying to parse the title from the following webpage: http://kid37.blogger.de/stories/1670573/

When I use the apache.commons.lang StringEscapeUtils.escapeHTML method on the title element I get the following

Das hermetische Caf�: Rock & Wrestling 2010

however when I display that in my webpage with utf-8 encoding it just shows a question mark.

Using the following code:

String title = StringEscapeUtils.escapeHtml(myTitle);

If I run the title through this website: http://tools.devshed.com/?option=com_mechtools&tool=27 I get the following output which seems correct

TITLE:

<title>Das hermetische Café: Rock &amp; Wrestling 2010</title>

BECOMES (which I was expecting the escapeHtml method to do):

<title>Das hermetische Caf&eacute;: Rock &amp; Wrestling 2010</title>

any ideas? thanks

like image 255
James Avatar asked Aug 19 '10 23:08

James


People also ask

What is the Unicode for a question mark?

In computing, the question mark character is represented by ASCII code 63 (0x3F hexadecimal), and is located at Unicode code-point U+003F ? QUESTION MARK ( &quest;).

Why does a diamond with a question mark in it appear in my HTML?

That web page doesn't specify an encoding and the server doesn't send one either, so the default in Firefox is used. If you get those diamonds then you either have a wrong encoding set or there is a problem with the (Verdana) font that is used. Western (ISO 8859-1) is the default encoding in Firefox.

What is the Unicode for upside down question mark?

“¿” U+00BF Inverted Question Mark Unicode Character.

What UTF 8 in HTML?

The HTML5 Standard: Unicode UTF-8 The Unicode Standard covers (almost) all the characters, punctuations, and symbols in the world. Unicode enables processing, storage, and transport of text independent of platform and language. The default character encoding in HTML-5 is UTF-8.


1 Answers

U+FFFD (decimal 65533) is the "replacement character". When a decoder encounters an invalid sequence of bytes, it may (depending on its configuration) substitute � for the corrupt sequence and continue.

One common reason for a "corrupt" sequence is that the wrong decoder has been applied. For example, the decoder might be UTF-8, but the page is actually encoded with ISO-8859-1 (the default if another is not specified in the content-type header or equivalent).

So, before you even pass the string to escapeHtml, the "é" has already been replaced with "�"; the method encodes this correctly.

The page in question uses ISO-8859-1 encoding. Make sure that you are using that decoder when converting the fetched resource to a String.

like image 179
erickson Avatar answered Sep 29 '22 23:09

erickson