I'm quite new to MediaWiki, and now I have a bit of a problem. I have the title of some Wiki page, and I want to get just the text of a said page using api.php, but all that I have found in the API is a way to obtain the Wiki content of the page (with wiki markup). I used this HTTP request... <pre class="prettyprint"><code>/api.php?action=query&prop=revisions&rvlimit=1&rvprop=content&format=xml&titles=test </code></pre> But I need only the textual content, without the Wiki markup. Is that possible with the MediaWiki API?

The TextExtracts extension of the API does about what you're asking. Use <code>prop=extracts</code> to get a cleaned up response. For example, this link will give you cleaned up text for the Stack Overflow article. What's also nice is that it still includes section tags, so you can identify individual sections of the article. Just to include a visible link in my answer, the above link looks like: <pre class="prettyprint"><code>/api.php?format=xml&action=query&prop=extracts&titles=Stack%20Overflow&redirects=true </code></pre> Edit: As Amr mentioned, TextExtracts is an extension to MediaWiki, so it won't necessarily be available for every MediaWiki site.

Get Text Content from mediawiki page via API

Tags:

wikipedia-api

mediawiki-api

mediawiki

I'm quite new to MediaWiki, and now I have a bit of a problem. I have the title of some Wiki page, and I want to get just the text of a said page using api.php, but all that I have found in the API is a way to obtain the Wiki content of the page (with wiki markup). I used this HTTP request...

/api.php?action=query&prop=revisions&rvlimit=1&rvprop=content&format=xml&titles=test

But I need only the textual content, without the Wiki markup. Is that possible with the MediaWiki API?

483

asked Oct 26 '09 14:10

Le_Coeur

2 Answers

Use action=parse to get the html:

/api.php?action=parse&page=test

One way to get the text from the html would be to load it into a browser and walk the nodes, looking only for the text nodes, using JavaScript.

145

answered Sep 23 '22 04:09

gilly3

The TextExtracts extension of the API does about what you're asking. Use prop=extracts to get a cleaned up response. For example, this link will give you cleaned up text for the Stack Overflow article. What's also nice is that it still includes section tags, so you can identify individual sections of the article.

Just to include a visible link in my answer, the above link looks like:

/api.php?format=xml&action=query&prop=extracts&titles=Stack%20Overflow&redirects=true

Edit: As Amr mentioned, TextExtracts is an extension to MediaWiki, so it won't necessarily be available for every MediaWiki site.

answered Sep 20 '22 04:09

eric.mitchell

Related questions
                            
                                Find Leaflet map object after initialisation
                            
                                Getting developers to use a wiki [closed]
                            
                                Parsing a Wikipedia dump
                            
                                All my MediaWiki pages are blank
                            
                                Convert Excel documents to wiki markup
                            
                                Changing the font size of code formatted by SyntaxHighlight GeSHi in MediaWiki
                            
                                Inline Syntax Highlighting in MediaWiki
                            
                                How to group wikipedia categories in python?
                            
                                How do I set og:image so it takes image from page?
                            
                                Convert LaTeX to MediaWiki syntax
                            
                                Is there any API in Java to access wikipedia data
                            
                                How to obtain a list of titles of all Wikipedia articles
                            
                                Searching Wikipedia using API
                            
                                How to get plain text out of Wikipedia
                            
                                What's the easiest way to add a quote box to mediawiki?
                            
                                JS: Failed to execute 'getComputedStyle' on 'Window': parameter is not of type 'Element'
                            
                                How do I create my own custom group in mediawiki?
                            
                                Any better way to create MediaWiki numbered lists?
                            
                                InnoDB: Attempted to open a previously opened tablespace
                            
                                Are there any tools to convert markdown to Wiki text in other formats

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With