Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is /๐ŸŒˆ an invalid path when /a๐ŸŒˆ is valid?

Iโ€™m trying to understand why certain HTML attributes are failing W3C validation. I encountered this in a real codebase, but hereโ€™s a minimal reproduction:

<!DOCTYPE html><html lang="en"><head><title>a</title></head><body>

<img alt="1" src="โญ">
<img alt="2" src="/โญ">
<img alt="3" src="/aโญ">
<img alt="4" src="/a/โญ">
<img alt="5" src="๐ŸŒˆ">
<img alt="6" src="/๐ŸŒˆ"> <!-- Only this is invalid. -->
<img alt="7" src="/a๐ŸŒˆ">
<img alt="8" src="/a/๐ŸŒˆ">

</body></html>

The W3C validator reports only one error, affecting the sixth image:

  1. Error: Bad value /๐ŸŒˆ for attribute src on element img: Illegal character in path segment: ? is not allowed.

    <img alt="6" src="/๐ŸŒˆ">
    

Why is only that one a problem, and not the others? Whatโ€™s different about it?

like image 511
รฆndrรผk Avatar asked Feb 12 '26 22:02

รฆndrรผk


1 Answers

The behavior described in the question was caused by a bug in the checker (validator) code thatโ€™s fixed now; see https://github.com/validator/galimatias/pull/2. The bug had gone unnoticed due to the test suite not having had coverage for the case of a relative URL that starts with a slash followed by a code point greater than U+FFFF โ€” like the U+1F308 ๐ŸŒˆ (rainbow) character in the question. So the test suite was also updated to add coverage for that case; see https://github.com/web-platform-tests/wpt/pull/36213.


Incidentally, the reason the U+2b50 (โญ) case wasnโ€™t affected by the bug while the U+1F308 (๐ŸŒˆ) case was is: Java uses UTF-16, and U+1F308 is in the range of so-called supplementary characters (that is, the set of code points above U+FFFF), and so โ€” as noted in a comment above โ€” in UTF-16 the code point U+1F308 is represented by a surrogate pair of two char values while U+2b50 is represented by a single char value.

And the reason the difference in how many char values affects how the URL is parsed is that the state machine in the HTML checkerโ€™s URL-parsing code maintains a character index and decrements it during state changes. And so, if itโ€™s handling a URL segment that can contain code points above U+FFFF, it must be smart about how many characters it decrements the index by โ€” it needs to decrement it by 2 for code points above U+FFFF, and by 1 otherwise.

And to do that, the code has a decrIdx() method that calls Character.charCount():

Determines the number of char values needed to represent the specified character (Unicode code point). If the specified character is equal to or greater than 0x10000, then the method returns 2. Otherwise, the method returns 1.

So the code change that got made to the checker replaced a simple idx-- decrementing of the index value with a smarter Character.charCount()-enabled decrIdx() call.

like image 187
sideshowbarker Avatar answered Feb 14 '26 13:02

sideshowbarker