Any ideas for a nice parser with an easy to use api that is configurable? I'm looking to feed it data such as http://wikitravel.org/wiki/en/api.php?format=xml&action=parse&prop=wikitext&page=San%20Francisco, choose sections of data I want, and output custom html for each unique type of element? Java would be preferred, but if there's a php/js solution that is compatible with most (99%+) wikitext, that would be okay as well.
Sweble is probably the best Java parser of wikitext. It claims to be 100% compliant with wikitext, but I seriously doubt that. It parses wikitext into an abstract syntax tree that you then have to do something with (like convert it to HTML).
There is a page on mediawiki.org that lists wikitext parsers in various programming languages. I don't think any of them do 99+% of wikitext though. In general parsing wikitext is a really complex problem. Wikitext isn't even formally defined anywhere outside of the MediaWiki parser itself.
This question was answered years ago, but I wanted to save future visitors the effort I had to take to figure out how to use Sweble.
You can try the documentation at their site, but I couldn't figure it out. Just look at the example source code. Download the source jar for swc-example-basic at https://repo1.maven.org/maven2/org/sweble/wikitext/swc-example-basic/2.0.0/swc-example-basic-2.0.0-sources.jar and look at App.java and TextConverter.java.
Basically, to parse a page and convert it to another form, you first add the following dependency to your project:
<dependency>
<groupId>org.sweble.wikitext</groupId>
<artifactId>swc-engine</artifactId>
<version>2.0.0</version>
</dependency>
Then, do the following:
public String convertWikiText(String title, String wikiText, int maxLineLength) throws LinkTargetException, EngineException {
// Set-up a simple wiki configuration
WikiConfig config = DefaultConfigEnWp.generate();
// Instantiate a compiler for wiki pages
WtEngineImpl engine = new WtEngineImpl(config);
// Retrieve a page
PageTitle pageTitle = PageTitle.make(config, title);
PageId pageId = new PageId(pageTitle, -1);
// Compile the retrieved page
EngProcessedPage cp = engine.postprocess(pageId, wikiText, null);
TextConverter p = new TextConverter(config, maxLineLength);
return (String)p.go(cp.getPage());
}
The TextConverter is a class you'll find in the examples I mentioned above. Customize it to do whatever you want. For example, the following makes sure all bold text is surrounded by "**":
public void visit(WtBold b)
{
write("**");
iterate(b);
write("**");
}
There are a bunch of visit methods on that class for each type of element that you'll encounter.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With