Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get HTML content text of a Wikipedia Page (via Wikipedia API)? [duplicate]

i just want to get content (no link, no categories, no images...just text)

like image 950
Leonardo Avatar asked May 07 '11 08:05

Leonardo


People also ask

Does Wikipedia use HTML?

The MediaWiki software, which drives Wikipedia, allows the use of a subset of HTML 5 elements, or tags and their attributes, for presentation formatting.

Is Wikipedia API free?

Wikipedia and other Wikimedia projects are free, collaborative repositories of knowledge, written and maintained by volunteers from around the world. The Wikimedia API gives you open access to add this free knowledge to your projects and apps.

Does Wikipedia provide API?

What is the Wikipedia API? The Wikipedia API (official documentation) is supported by the MediaWiki's API and provide access to Wikipedia and other MediaWiki data without interacting with the user interface.

How do I find my Wikipedia page ID API?

This page is a simple guide to finding that ID. In the desktop view of Wikipedia, in the default skin and most others, the left-hand panel has a "Wikidata item" link, under " tools ". Copy the URL of that link, paste it into a text editor, and read (or copy) the ID from it.


1 Answers

There is no way to get "just the text" from the Wikipedia API. You can either download the HTML of the page (if you do this via index.php rather than api.php, use action=render to avoid downloading all the skin content) or the wikitext (which you can do via the API or by passing action=raw to index.php); you will then have to parse it yourself to remove the bits you don't want to keep.

In the HTML output, MediaWiki is generally good about adding classes to various interface elements you might want to filter out; the templates and such created by users are perhaps less so (e.g. the hack for table sorting just puts some text in a display:none span, no class).

To get the wikitext via the API, use prop=revisions. To get the rendered HTML, use action=parse.

like image 137
Anomie Avatar answered Sep 28 '22 23:09

Anomie