Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Wikipedia (MediaWiki) URI encoding scheme

How do Wikipedia (or MediaWiki in general) encode page titles in URIs? It's not normal URI encoding, since spaces are replaced with underscores and double quotes are not encoded and things like that.

like image 458
parsa Avatar asked Oct 06 '10 05:10

parsa


2 Answers

http://en.wikipedia.org/wiki/Wikipedia:Naming_conventions_%28technical_restrictions%29 - here you've got some kind of description of what their engine enforces on article names.

They should have something like this in their LocalSettings.php: $wgArticlePath = '/wiki/$1';

and proper server URI rewrites configuration - they seem to be using Apache (HTTP header), so it's probably mod_rewrite. http://www.mediawiki.org/wiki/Manual:Short_URL

You can also refer to the index.php file for an article on Wikipedia like this: http://en.wikipedia.org/w/index.php?title=Foo%20bar and get redirected by the engine to http://en.wikipedia.org/wiki/Foo_bar. Behind the scenes mod_rewrite translates it into /index.php?title=Foo_bar. For the MediaWiki engine it's the same as if you visited http://en.wikipedia.org/w/index.php?title=Foo_bar - this page doesn't redirect you.

like image 151
Zygmunt Avatar answered Nov 12 '22 11:11

Zygmunt


The process is quite complex and isn't exactly pretty. You need to look at the Title class found in includes/Title.php. You should start with the newFromText method, but the bulk of the logic is in the secureAndSplit method.

Note that (as ever with MediaWiki) the code is not decoupled in the slightest. If you want to replicate it, you'll need to extract the logic rather than simply re-using the class.

The logic looks something like this:

  • Decode character references (e.g. é)
  • Convert spaces to underscores
  • Check whether the title is a reference to a namespace or interwiki
  • Remove hash fragments (e.g. Apple#Name
  • Remove forbidden characters
  • Forbid subdirectory links (e.g. ../directory/page)
  • Forbid triple tilde sequences (~~~) (for some reason)
  • Limit the size to 255 bytes
  • Capitalise the first letter

Furthermore, I believe I'm right in saying that quotation marks don't need to be encoded by the original user -- browsers can handle them transparently.

I hope that helps!

like image 37
lonesomeday Avatar answered Nov 12 '22 11:11

lonesomeday