Why is /🌈 an invalid path when /a🌈 is valid?

Question

I’m trying to understand why certain HTML attributes are failing W3C validation. I encountered this in a real codebase, but here’s a minimal reproduction:

<!DOCTYPE html><html lang="en"><head><title>a</title></head><body>

<img alt="1" src="⭐">
<img alt="2" src="/⭐">
<img alt="3" src="/a⭐">
<img alt="4" src="/a/⭐">
<img alt="5" src="🌈">
<img alt="6" src="/🌈"> <!-- Only this is invalid. -->
<img alt="7" src="/a🌈">
<img alt="8" src="/a/🌈">

</body></html>

The W3C validator reports only one error, affecting the sixth image:

Error: Bad value /🌈 for attribute src on element img: Illegal character in path segment: ? is not allowed.
<img alt="6" src="/🌈">

Why is only that one a problem, and not the others? What’s different about it?

sideshowbarker · Accepted Answer

The behavior described in the question was caused by a bug in the checker (validator) code that’s fixed now; see https://github.com/validator/galimatias/pull/2. The bug had gone unnoticed due to the test suite not having had coverage for the case of a relative URL that starts with a slash followed by a code point greater than U+FFFF — like the U+1F308 🌈 (rainbow) character in the question. So the test suite was also updated to add coverage for that case; see https://github.com/web-platform-tests/wpt/pull/36213.

Incidentally, the reason the U+2b50 (⭐) case wasn’t affected by the bug while the U+1F308 (🌈) case was is: Java uses UTF-16, and U+1F308 is in the range of so-called supplementary characters (that is, the set of code points above U+FFFF), and so — as noted in a comment above — in UTF-16 the code point U+1F308 is represented by a surrogate pair of two char values while U+2b50 is represented by a single char value.

And the reason the difference in how many char values affects how the URL is parsed is that the state machine in the HTML checker’s URL-parsing code maintains a character index and decrements it during state changes. And so, if it’s handling a URL segment that can contain code points above U+FFFF, it must be smart about how many characters it decrements the index by — it needs to decrement it by 2 for code points above U+FFFF, and by 1 otherwise.

And to do that, the code has a decrIdx() method that calls Character.charCount():

Determines the number of char values needed to represent the specified character (Unicode code point). If the specified character is equal to or greater than 0x10000, then the method returns 2. Otherwise, the method returns 1.

So the code change that got made to the checker replaced a simple idx-- decrementing of the index value with a smarter Character.charCount()-enabled decrIdx() call.

Why is /🌈 an invalid path when /a🌈 is valid?

Tags:

html

w3c-validation

emoji

url-parsing

ændrük

1 Answers

sideshowbarker

Recent Activity

Donate For Us

Why is /🌈 an invalid path when /a🌈 is valid?

Tags:

html

w3c-validation

emoji

url-parsing

ændrük

1 Answers

sideshowbarker

Related questions

Recent Activity

Donate For Us