I have some URL that contains special characters. For example:
http://www.example.com/bléèàû.html
If you type this URL in a browser, my web server would show the correct page (it can handle special character).
I have looked at the sitemaps specs and it's not clear whether or not sitemaps file can contain special character. From what I understand of the protocol, if the URL is working fine and the server serves the correct page and the XML file is UTF-8 encoded, then it's ok.
For example, this entry is a valid sitemaps entry:
<url>
<loc>http://www.example.com/bléèàû.html</loc>
<changefreq>weekly</changefreq>
</url>
Anyone can confirm this?
[Update] The reason I'm reluctant to encode the special characters is that I don't want to introduce duplicate URLs for the same content. For example
http://www.example.com/bl%C3%A9%C3%A8%C3%A0%C3%BB.html
and
http://www.example.com/bléèàû.html
would serve the same page. I presume Google would catch both URL with its normal indexing and the sitemaps. Unfortunately Google have a tendency to downgrade page rank of sites that have duplicate URLs pointing to the same page.
The sitemaps specification doesn't say. It shows examples of URLs in various escaped forms but does not definitively say whether the first example (raw characters) is allowable. It only calls them ‘URL’s, with no reference to a particular definition of ‘URL’ or RFC which would clarify whether they mean old-school ASCII URIs, or IRIs (which may contain non-ASCII characters).
So it would be safest to %-escape the UTF-8 encoding of the URL. The link will then work globally, and should be presented to the user as a Unicode character in all modern browsers.
<loc>http://www.example.com/bl%C3%A9%C3%A8%C3%A0%C3%BB.html</loc>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With