I would like to write a C# method that would transform any title into a URL friendly string, similar to what Stack Overflow does:
I'm thinking of removing Reserved characters as per RFC 3986 standard (from Wikipedia) but I don't know if that would be enough? It would make links workable, but does anyone know what other characters are being replaced here at stackoverflow? I don't want to end up with %-s in my URLs...
string result = Regex.Replace(value.Trim(), @"[!*'""`();:@&+=$,/\\?%#\[\]<>«»{}_]");
return Regex.Replace(result.Trim(), @"[\s*[\-–—\s]\s*]", "-");
Rather than looking for things to replace, the list of unreserved chars is so short, it'll make for a nice clear regex.
return Regex.Replace(value, @"[^A-Za-z0-9_\.~]+", "-");
(Note that I didn't include the dash in the list of allowed chars; that's so it gets gobbled up by the "1 or more" operator [+
] so that multiple dashes (in the original or generated or a combination) are collapsed, as per Dominic Rodger's excellent point.)
You may also want to remove common words ("the", "an", "a", etc.), although doing so can slightly change the meaning of a sentence. Probably want to remove any trailing dashes and periods as well.
Also strongly recommend you do what SO and others do, and include a unique identifier other than the title, and then only use that unique ID when processing the URL. So http://example.com/articles/1234567/is-the-pop-catholic
(note the missing 'e') and http://example.com/articles/1234567/is-the-pope-catholic
resolve to the same resource.
I would be doing:
string url = title;
url = Regex.Replace(url, @"^\W+|\W+$", "");
url = Regex.Replace(url, @"'\"", "");
url = Regex.Replace(url, @"_", "-");
url = Regex.Replace(url, @"\W+", "-");
Basically what this is doing is it:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With