Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Where can I find a good MediaWiki Markup parser in PHP?

I would try hacking MediaWiki's code a little, but I figured out it would be unnecessary if I can get an independent parser.

Can anyone help me with this?

Thanks.

like image 544
Aleph Dvorak Avatar asked Jun 22 '09 19:06

Aleph Dvorak


People also ask

What is MediaWiki syntax parser?

This is a parser for MediaWiki's (MW) syntax. It's goal is to transform wikitext into an abstract syntax tree (AST) and then render this AST into various formats such as plain text and HTML. Two files, preprocessor.pijnu and mediawiki.pijnu describe the MW syntax using patterns that form a grammar.

Should we produce a specification of MediaWiki's markup format?

Produce a specification of MediaWiki's markup format that is sufficiently complete and consistent. Future parser implementations can be built from it. Also, features that are currently either Not Possible or Very Hard (e.g. WYSIWYG editing) could benefit from such a specification.

What is wikitext in MediaWiki?

The MediaWiki markup language (commonly referred to within the MediaWiki community as wikitext, though this usage is ambiguous within the larger wiki community) uses sometimes paired non-textual ASCII characters to indicate to the parser how the editor wishes an item or section of text to be displayed.

What is the best MediaWiki library in Ruby?

WikiCloth is another Ruby library for Mediawiki markup. Infoboxer is Ruby MediaWiki client and parser, aiming (mostly successfully) to successfully parse and navigate any page of any WikiMedia project, including Wikipedia.


2 Answers

Ben Hughes is right. It's very difficult to get right, especially if you want to parse real articles from big wikis like Wikipedia itself with 100% accuracy. It is discussed frequently in the wikitech mailing list and no alternative parser has come up with the goods despite many attempts.

Firstly it's not really a parser in that it has no such concept as an AST (abstract syntax tree). It's a converter that specifically converts to HTML.

Secondly don't fall into the trap of thinking of wikitext as a markup language which can be extended on rare occasions with HTML. You must think of it as an extension to HTML. It is much easier to add wikitext support to an HTML parser than to add HTML support to a wikitext parser.

What this boils down to is that if you want any other format you will need to convert from HTML to that format.

Basically it is stated that only MediaWiki can parse wikitext. But yes the parser is tightly integrated with the rest of the code. Experienced MediaWiki hackers do not react well to questions about isolating the parser - I've tried (-:

But I've also gone ahead and isolated it anyway. It's not complete or ready to share with anybody yet. But basically you want to start with the MediaWiki source not installed or connected to a database or web server. Make a PHP stub program that includes the parser and call an entry point. Check the error when it fails to run and make a phony stub for the class, function, or global that was accessed. Repeat until you have stubbed most of the places the parser interacts with the rest of MediaWiki.

The problem then comes in keeping your hacked stubbed variant in synch because the source tree changes quickly and the live wikis embrace the changes in the parser very quickly and your variant will have to keep up if it is to work into the future.

Check out my feature request: Bug 25984 - Isolate parser from database dependencies

like image 68
hippietrail Avatar answered Sep 24 '22 16:09

hippietrail


It's actually an incredibly difficult format to parse. You can try to separate out the parser component from media wiki (as it is also php), but it is a tangled mess. I've seen a few partial standalone ones that do a nearly reasonable job for a very limited subset of the markup.

If you happen to implement one, or refactor the current wikipedia one let me know as it could be quite useful.

like image 22
Ben Hughes Avatar answered Sep 23 '22 16:09

Ben Hughes