Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Transform title into dashed URL-friendly string [closed]

Tags:

I would like to write a C# method that would transform any title into a URL friendly string, similar to what Stack Overflow does:

  • replace spaces with dashes
  • remove parenthesis
  • etc.

I'm thinking of removing Reserved characters as per RFC 3986 standard (from Wikipedia) but I don't know if that would be enough? It would make links workable, but does anyone know what other characters are being replaced here at stackoverflow? I don't want to end up with %-s in my URLs...

Current implementation

string result = Regex.Replace(value.Trim(), @"[!*'""`();:@&+=$,/\\?%#\[\]<>«»{}_]");
return Regex.Replace(result.Trim(), @"[\s*[\-–—\s]\s*]", "-");

My questions

  1. Which characters should I remove?
  2. Should I limit the maximum length of resulting string?
  3. Anyone know which rules are applied on titles here on SO?
like image 804
Robert Koritnik Avatar asked Jan 29 '10 11:01

Robert Koritnik


2 Answers

Rather than looking for things to replace, the list of unreserved chars is so short, it'll make for a nice clear regex.

return Regex.Replace(value, @"[^A-Za-z0-9_\.~]+", "-");

(Note that I didn't include the dash in the list of allowed chars; that's so it gets gobbled up by the "1 or more" operator [+] so that multiple dashes (in the original or generated or a combination) are collapsed, as per Dominic Rodger's excellent point.)

You may also want to remove common words ("the", "an", "a", etc.), although doing so can slightly change the meaning of a sentence. Probably want to remove any trailing dashes and periods as well.

Also strongly recommend you do what SO and others do, and include a unique identifier other than the title, and then only use that unique ID when processing the URL. So http://example.com/articles/1234567/is-the-pop-catholic (note the missing 'e') and http://example.com/articles/1234567/is-the-pope-catholic resolve to the same resource.

like image 183
T.J. Crowder Avatar answered Oct 25 '22 11:10

T.J. Crowder


I would be doing:

string url = title;
url = Regex.Replace(url, @"^\W+|\W+$", "");
url = Regex.Replace(url, @"'\"", "");
url = Regex.Replace(url, @"_", "-");
url = Regex.Replace(url, @"\W+", "-");

Basically what this is doing is it:

  • strips non-word characters from the beginning and end of the title;
  • removes single and double quotes (mainly to get rid of apostrophes in the middle of words);
  • replaces underscores with hyphens (underscores are technically a word character along with digits and letters); and
  • replaces all groups of non-word characters with a single hyphen.
like image 22
cletus Avatar answered Oct 25 '22 11:10

cletus