Best approach to render MediaWiki in C#?

Question

Question:

I want to render MediaWiki syntax (and I mean MediaWiki syntax as used by WikiPedia, not some other wiki format from some other engine such as WikiPlex), and that in C#.

Input: MediaWiki Markup string
Output: HTML string

There are some alternative mediawiki parsers, but nothing in C#, and additionally pinvoking C/C++ looks bleak, because of the structure of those libaries.

As syntax guidance, I use http://en.wikipedia.org/wiki/Wikipedia:Cheatsheet

My first goal is to render that page's markup correctly.

Markup can be seen here: http://en.wikipedia.org/w/index.php?title=Wikipedia:Cheatsheet&action=edit

Now, if I use Regex, it's not of much use, because one can't exactly say which tag ends which starting ones, especially when some elements, such as italic, become an attribute of the parent element.

On the other hand, parsing character by character is not a good approach either, because for example ''' means bold, '' means italic, and ''''' means bold and italic...

I looked into porting some of the other parsers' code, but the java implementations are obscure, and the Python implementations have have a very different regex syntax.

The best approach I see so far would be to port mwlib to IronPython http://www.mediawiki.org/wiki/Alternative_parsers

But frankly, I'm not looking forward to having the IronPython runtime added as a dependency to my application, and even if I would want to, the documentation is bad at best.

Stefan Steiger · Accepted Answer

Update per 2017:
You can use ParseoidSharp to get a fully compatible MediaWiki-renderer.
It uses the official Wikipedia Parsoid library via NodeServices.
(NetStandard 2.0) Since Parsoid is GPL 2.0, and and the GPL-code is invoked in nodejs in a separate process via network, you can even use any license you like ;)

Pre-2017

Problem solved. As originally assumed, the solution lies in using one of the existing alternative parsers in C#.
WikiModel (Java) works well for that purpose.

First attempt was pinvoke kiwi. It worked, but but failed because:

kiwi uses char* (fails on anything non-English/ASCII)
not thread safe.
bad because of the need have a native dll in the code for every architecture (did add x86 and amd64, then it went kaboom on my ARM processor)

Second attempt was mwlib. That failed because somehow IronPython doesn't work as it should.

Third attempt was Swebele, which essentially turned out to be academic vapoware.

The fourth attempt was using the original mediawiki renderer, using Phalanger. That failed because the MediaWiki renderer is not really modular.

The fifth attempt was using Wiky.php via Phalanger, which worked, but was slow and Wiky.php doesn't very completely implement MediaWiki.

The sixth attempt was using bliki via ikvmc, which failed because of the excessive use of 3rd party libraries ==> it compiles, but yields null-reference exceptions only

The seventh attempt was using JavaScript in C#, which worked but was very slow, plus the MediaWiki functionality implemented was very incomplete.

The 8th attempt was writing an own "parser" via Regex.
But the time required to make it work is just excessive, so I stopped.

The 9th attempt was successful. Using ikvmc on WikiModel yields a useful dll. The problem there was the example-code was hoplessly out of date. But using google and the WikiModel sourcecode, I was able to piece it together.

The end-result can be found here:
https://github.com/ststeiger/MultiWikiParser

Maarten van der Lee · Answer

Why shouldn't this be possible with regular expressions?

inputString = Regex.Replace(inputString, @"(?:'''''')(.*?)(?:'''''')", @"<strong><em>$1</em></strong>");
inputString = Regex.Replace(inputString, @"(?:''')(.*?)(?:''')", @"<strong>$1</strong>");
inputString = Regex.Replace(inputString, @"(?:'')(.*?)(?:'')", @"<em>$1</em>");

This will, as far as I can see, render all 'Bold and italic', 'Bold' and 'Italic' text.

devio · Answer

Here is how I once implemented a solution:

define your regular expressions for Markup->HTML conversion
regular expressions must be non greedy
collect the regular expressions in a Dictionary<char, List<RegEx>>

The char is the first (Markup) character in each RegEx, and RegEx's must be sorted by Markup keyword length desc, e.g. === before ==.

Iterate through the characters of the input string, and check if Dictionary.ContainsKey(char). If it does, search the List for matching RegEx. First matching RegEx wins.

As MediaWiki allows recursive markup (except for <pre> and others), the string inside the markup must also be processed in this fashion recursively.

If there is a match, skip ahead the number of characters matching the RegEx in input string. Otherwise proceed to next character.

Best approach to render MediaWiki in C#?

Tags:

c#

parsing

asp.net

.net-core

mediawiki

Stefan Steiger

3 Answers

Stefan Steiger

Maarten van der Lee

devio

Recent Activity

Donate For Us

Best approach to render MediaWiki in C#?

Tags:

c#

parsing

asp.net

.net-core

mediawiki

Stefan Steiger

3 Answers

Stefan Steiger

Maarten van der Lee

devio

Related questions

Recent Activity

Donate For Us