Question:
I want to render MediaWiki syntax (and I mean MediaWiki syntax as used by WikiPedia, not some other wiki format from some other engine such as WikiPlex), and that in C#.
Input: MediaWiki Markup string
Output: HTML string
There are some alternative mediawiki parsers, but nothing in C#, and additionally pinvoking C/C++ looks bleak, because of the structure of those libaries.
As syntax guidance, I use http://en.wikipedia.org/wiki/Wikipedia:Cheatsheet
My first goal is to render that page's markup correctly.
Markup can be seen here: http://en.wikipedia.org/w/index.php?title=Wikipedia:Cheatsheet&action=edit
Now, if I use Regex, it's not of much use, because one can't exactly say which tag ends which starting ones, especially when some elements, such as italic, become an attribute of the parent element.
On the other hand, parsing character by character is not a good approach either, because for example ''' means bold, '' means italic, and ''''' means bold and italic...
I looked into porting some of the other parsers' code, but the java implementations are obscure, and the Python implementations have have a very different regex syntax.
The best approach I see so far would be to port mwlib to IronPython http://www.mediawiki.org/wiki/Alternative_parsers
But frankly, I'm not looking forward to having the IronPython runtime added as a dependency to my application, and even if I would want to, the documentation is bad at best.
Update per 2017:
You can use ParseoidSharp to get a fully compatible MediaWiki-renderer.
It uses the official Wikipedia Parsoid library via NodeServices.
(NetStandard 2.0)
Since Parsoid is GPL 2.0, and and the GPL-code is invoked in nodejs in a separate process via network, you can even use any license you like ;)
Problem solved.
As originally assumed, the solution lies in using one of the existing alternative parsers in C#.
WikiModel (Java) works well for that purpose.
First attempt was pinvoke kiwi. It worked, but but failed because:
Second attempt was mwlib. That failed because somehow IronPython doesn't work as it should.
Third attempt was Swebele, which essentially turned out to be academic vapoware.
The fourth attempt was using the original mediawiki renderer, using Phalanger. That failed because the MediaWiki renderer is not really modular.
The fifth attempt was using Wiky.php via Phalanger, which worked, but was slow and Wiky.php doesn't very completely implement MediaWiki.
The sixth attempt was using bliki via ikvmc, which failed because of the excessive use of 3rd party libraries ==> it compiles, but yields null-reference exceptions only
The seventh attempt was using JavaScript in C#, which worked but was very slow, plus the MediaWiki functionality implemented was very incomplete.
The 8th attempt was writing an own "parser" via Regex.
But the time required to make it work is just excessive, so I stopped.
The 9th attempt was successful. Using ikvmc on WikiModel yields a useful dll. The problem there was the example-code was hoplessly out of date. But using google and the WikiModel sourcecode, I was able to piece it together.
The end-result can be found here:
https://github.com/ststeiger/MultiWikiParser
Why shouldn't this be possible with regular expressions?
inputString = Regex.Replace(inputString, @"(?:'''''')(.*?)(?:'''''')", @"<strong><em>$1</em></strong>");
inputString = Regex.Replace(inputString, @"(?:''')(.*?)(?:''')", @"<strong>$1</strong>");
inputString = Regex.Replace(inputString, @"(?:'')(.*?)(?:'')", @"<em>$1</em>");
This will, as far as I can see, render all 'Bold and italic', 'Bold' and 'Italic' text.
Here is how I once implemented a solution:
Dictionary<char, List<RegEx>>
The char is the first (Markup) character in each RegEx, and RegEx's must be sorted by Markup keyword length desc, e.g. ===
before ==
.
Iterate through the characters of the input string, and check if Dictionary.ContainsKey(char). If it does, search the List for matching RegEx. First matching RegEx wins.
As MediaWiki allows recursive markup (except for <pre> and others), the string inside the markup must also be processed in this fashion recursively.
If there is a match, skip ahead the number of characters matching the RegEx in input string. Otherwise proceed to next character.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With