Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Canonicalize URL to lowercase without breaking file system or culture?

Tags:

c#

asp.net

winapi

Canonicalizing URLs to Lowercase

I wish to write an HTTP module that converts URLs to lowercase. My first attempt ignored international character sets and works great:

// Convert URL virtual path to lowercase
string lowercase = context.Request.FilePath.ToLowerInvariant();

// If anything changed then issue 301 Permanent Redirect
if (!lowercase.Equals(context.Request.FilePath, StringComparison.Ordinal))
{
    context.Response.RedirectPermanent(...lowercase URL...);
}

The Turkey Test (international cultures):

But what about cultures other than en-US? I referred to the Turkey Test to come up with a test URL:

http://example.com/Iıİi

This little insidious gem destroys any notion that case conversion in URLs is simple! Its lowercase and uppercase versions, respectively, are:

http://example.com/ııii
http://example.com/IIİİ

For case conversion to work with Turkish URLs, I first had to set the current culture of ASP.NET to Turkish:

<system.web>
    <globalization culture="tr-TR" />
</system.web>

Next, I had to change my code to use the current culture for the case conversion:

// Convert URL virtual path to lowercase
string lowercase = context.Request.FilePath.ToLower(CultureInfo.CurrentCulture);

// If anything changed then issue 301 Permanent Redirect
if (!lowercase.Equals(context.Request.FilePath, StringComparison.Ordinal))
{
    context.Response.RedirectPermanent(...);
}

But wait! Will StringComparison.Ordinal still work? Or should I use StringComparison.CurrentCulture? I'm really not certain of either!

File names: It gets MUCH WORSE!

Even if the above works, using the current culture for case conversions breaks the NTFS file system! Let's say I have a static file with the name Iıİi.html:

http://example.com/Iıİi.html

Even though the Windows file system is case-insensitive it does not use language culture. Converting the above URL to lowercase results in a 404 Not Found because the file system doesn't consider the two names as equal:

http://example.com/ııii.html

The correct case conversion for file names? WHO KNOWS?!

The MSDN article, Best Practices for Using Strings in the .NET Framework, has a note (about halfway through the article):

Note: The string behavior of the file system, registry keys and values, and environment variables is best represented by StringComparison.OrdinalIgnoreCase.

Huh? Best represented??? Is that the best we can do in C#? So just what is the correct case conversion to match the file system? Who knows?!!? About all we can say is that string comparisons using the above will probably work MOST of the time.

Summary: Two case conversions: Static/Dynamic URLs

  1. So we've seen that static URLs---URLs having a file path that matches a real directory/file in the file system---must use an unknown case conversion that is only "best represented" by StringComparison.OrdinalIgnoreCase. And please note there is no string.ToLowerOrdinal() method so it's very difficult to know exactly what case conversion equates to the OrdinalIgnoreCase string comparison. Using string.ToLowerInvariant() is probably the best bet, yet it breaks language culture.
  2. On the other hand, dynamic URLs---URLs with a file path that does not match a real file on the disk (that map to your application)---can use string.ToLower(CultureInfo.CurrentCulture), but it breaks file system matching and it is somewhat unclear what edge cases exist that may break this strategy.

Thus, it appears case conversion first requires detection as to whether a URL is static or dynamic before choosing one of two conversion methods. For static URLs there is uncertainty how to change case without breaking the Windows file system. For dynamic URLs it is questionable if case conversion using culture will similarly break the URL.

Whew! Anyone have a solution to this mess? Or should I just close my eyes and pretend everything is ASCII?

like image 670
Kevin P. Rice Avatar asked Jan 24 '12 08:01

Kevin P. Rice


1 Answers

I would challenge the premise here that there is any utility whatsoever in attempting to auto-convert URLs to lowercase.

Whether a full URL is case-sensitive or not depends entirely on the web server, web application framework, and underlying file system.

You're only guaranteed case-insensitivity in the scheme (http://, etc.) and hostname portions of the URL. And remember that not all URL schemes (file and news, for example) even include a hostname.

Everything else can be case-sensitive to the server, including paths (/), filenames, queries (?), fragments (#), and authority info (usernames/passwords before the @ in mailto, http, ftp, and some other schemes).

like image 108
richardtallent Avatar answered Nov 15 '22 08:11

richardtallent