Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should I encode special character in my sitemaps?

Tags:

sitemap

I have some URL that contains special characters. For example:

http://www.example.com/bléèàû.html

If you type this URL in a browser, my web server would show the correct page (it can handle special character).

I have looked at the sitemaps specs and it's not clear whether or not sitemaps file can contain special character. From what I understand of the protocol, if the URL is working fine and the server serves the correct page and the XML file is UTF-8 encoded, then it's ok.

For example, this entry is a valid sitemaps entry:

   <url>
      <loc>http://www.example.com/bléèàû.html</loc>
      <changefreq>weekly</changefreq>
   </url>

Anyone can confirm this?

[Update] The reason I'm reluctant to encode the special characters is that I don't want to introduce duplicate URLs for the same content. For example

http://www.example.com/bl%C3%A9%C3%A8%C3%A0%C3%BB.html

and

http://www.example.com/bléèàû.html

would serve the same page. I presume Google would catch both URL with its normal indexing and the sitemaps. Unfortunately Google have a tendency to downgrade page rank of sites that have duplicate URLs pointing to the same page.

like image 542
Thierry-Dimitri Roy Avatar asked Jan 23 '23 20:01

Thierry-Dimitri Roy


1 Answers

The sitemaps specification doesn't say. It shows examples of URLs in various escaped forms but does not definitively say whether the first example (raw characters) is allowable. It only calls them ‘URL’s, with no reference to a particular definition of ‘URL’ or RFC which would clarify whether they mean old-school ASCII URIs, or IRIs (which may contain non-ASCII characters).

So it would be safest to %-escape the UTF-8 encoding of the URL. The link will then work globally, and should be presented to the user as a Unicode character in all modern browsers.

<loc>http://www.example.com/bl%C3%A9%C3%A8%C3%A0%C3%BB.html</loc>
like image 92
bobince Avatar answered Jan 26 '23 11:01

bobince