Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java Wikitext Parser [closed]

Any ideas for a nice parser with an easy to use api that is configurable? I'm looking to feed it data such as http://wikitravel.org/wiki/en/api.php?format=xml&action=parse&prop=wikitext&page=San%20Francisco, choose sections of data I want, and output custom html for each unique type of element? Java would be preferred, but if there's a php/js solution that is compatible with most (99%+) wikitext, that would be okay as well.

like image 480
No_name Avatar asked Jul 23 '12 12:07

No_name


2 Answers

Sweble is probably the best Java parser of wikitext. It claims to be 100% compliant with wikitext, but I seriously doubt that. It parses wikitext into an abstract syntax tree that you then have to do something with (like convert it to HTML).

There is a page on mediawiki.org that lists wikitext parsers in various programming languages. I don't think any of them do 99+% of wikitext though. In general parsing wikitext is a really complex problem. Wikitext isn't even formally defined anywhere outside of the MediaWiki parser itself.

like image 70
Christian Avatar answered Sep 25 '22 03:09

Christian


This question was answered years ago, but I wanted to save future visitors the effort I had to take to figure out how to use Sweble.

You can try the documentation at their site, but I couldn't figure it out. Just look at the example source code. Download the source jar for swc-example-basic at https://repo1.maven.org/maven2/org/sweble/wikitext/swc-example-basic/2.0.0/swc-example-basic-2.0.0-sources.jar and look at App.java and TextConverter.java.

Basically, to parse a page and convert it to another form, you first add the following dependency to your project:

    <dependency>
        <groupId>org.sweble.wikitext</groupId>
        <artifactId>swc-engine</artifactId>
        <version>2.0.0</version>
    </dependency>

Then, do the following:

public String convertWikiText(String title, String wikiText, int maxLineLength) throws LinkTargetException, EngineException {
    // Set-up a simple wiki configuration
    WikiConfig config = DefaultConfigEnWp.generate();
    // Instantiate a compiler for wiki pages
    WtEngineImpl engine = new WtEngineImpl(config);
    // Retrieve a page
    PageTitle pageTitle = PageTitle.make(config, title);
    PageId pageId = new PageId(pageTitle, -1);
    // Compile the retrieved page
    EngProcessedPage cp = engine.postprocess(pageId, wikiText, null);
    TextConverter p = new TextConverter(config, maxLineLength);
    return (String)p.go(cp.getPage());
}

The TextConverter is a class you'll find in the examples I mentioned above. Customize it to do whatever you want. For example, the following makes sure all bold text is surrounded by "**":

public void visit(WtBold b)
{
    write("**");
    iterate(b);
    write("**");
}

There are a bunch of visit methods on that class for each type of element that you'll encounter.

like image 25
HappyEngineer Avatar answered Sep 24 '22 03:09

HappyEngineer