Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get Text Content from mediawiki page via API

I'm quite new to MediaWiki, and now I have a bit of a problem. I have the title of some Wiki page, and I want to get just the text of a said page using api.php, but all that I have found in the API is a way to obtain the Wiki content of the page (with wiki markup). I used this HTTP request...

/api.php?action=query&prop=revisions&rvlimit=1&rvprop=content&format=xml&titles=test 

But I need only the textual content, without the Wiki markup. Is that possible with the MediaWiki API?

like image 483
Le_Coeur Avatar asked Oct 26 '09 14:10

Le_Coeur


People also ask

Is there an API for Wikipedia?

The Wikimedia API lets you build apps and scripts that access content from Wikipedia and other Wikimedia projects.

What is MediaWiki API?

The MediaWiki Action API is a RESTful web service that allows users to perform certain wiki-actions like page creation, authentication, parsing, searching, etc. API:Main page is a good starting point for understanding the API. Your program sends requests to the API to get access to wiki features.

Does Wikipedia have a free API?

A web-based free encyclopedia, Wikipedia is available in many languages and is among the top 5 websites on the internet. The largest general reference body of work on the internet, its uses in applications offer countless possibilities.

Does Wikipedia run on MediaWiki?

MediaWiki is a free and open-source wiki software. It is used on Wikipedia and almost all other Wikimedia websites, including Wiktionary, Wikimedia Commons and Wikidata; these sites define a large part of the requirement set for MediaWiki.


2 Answers

Use action=parse to get the html:

/api.php?action=parse&page=test

One way to get the text from the html would be to load it into a browser and walk the nodes, looking only for the text nodes, using JavaScript.

like image 145
gilly3 Avatar answered Sep 23 '22 04:09

gilly3


The TextExtracts extension of the API does about what you're asking. Use prop=extracts to get a cleaned up response. For example, this link will give you cleaned up text for the Stack Overflow article. What's also nice is that it still includes section tags, so you can identify individual sections of the article.

Just to include a visible link in my answer, the above link looks like:

/api.php?format=xml&action=query&prop=extracts&titles=Stack%20Overflow&redirects=true 

Edit: As Amr mentioned, TextExtracts is an extension to MediaWiki, so it won't necessarily be available for every MediaWiki site.

like image 35
eric.mitchell Avatar answered Sep 20 '22 04:09

eric.mitchell