I'd like to write a script that gets the Wikipedia description section only. That is, when I say <pre class="prettyprint"><code>/wiki bla bla bla </code></pre> it will go to the Wikipedia page for <code>bla bla bla</code>, get the following, and return it to the chatroom: <blockquote> "Bla Bla Bla" is the name of a song made by Gigi D'Agostino. He described this song as "a piece I wrote thinking of all the people who talk and talk without saying anything". The prominent but nonsensical vocal samples are taken from UK band Stretch's song "Why Did You Do It" </blockquote> How can I do this?

Use the MediaWiki API, which runs on Wikipedia. You will have to do some parsing of the data yourself. For instance: <blockquote> http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&&titles=Bla%20Bla%20Bla </blockquote> means <blockquote> fetch (action=query) the content (rvprop=content) of the most recent revision of Main Page (title=Main%20Page) in JSON format (format=json). </blockquote> You will probably want to search for the query and use the first result, to handle spelling errors and the like.

How to get plain text out of Wikipedia

Tags:

python

wikipedia

wikipedia-api

mediawiki-api

mediawiki

I'd like to write a script that gets the Wikipedia description section only. That is, when I say

/wiki bla bla bla

it will go to the Wikipedia page for bla bla bla, get the following, and return it to the chatroom:

"Bla Bla Bla" is the name of a song made by Gigi D'Agostino. He described this song as "a piece I wrote thinking of all the people who talk and talk without saying anything". The prominent but nonsensical vocal samples are taken from UK band Stretch's song "Why Did You Do It"

How can I do this?

770

asked Dec 15 '10 16:12

Wifi

2 Answers

Here are a few different possible approaches; use whichever works for you. All my code examples below use requests for HTTP requests to the API; you can install requests with pip install requests if you have Pip. They also all use the Mediawiki API, and two use the query endpoint; follow those links if you want documentation.

1. Get a plain text representation of either the entire page or the page "extract" straight from the API with the `extracts` prop

Note that this approach only works on MediaWiki sites with the TextExtracts extension. This notably includes Wikipedia, but not some smaller Mediawiki sites like, say, http://www.wikia.com/

You want to hit a URL like

https://en.wikipedia.org/w/api.php?action=query&format=json&titles=Bla_Bla_Bla&prop=extracts&exintro&explaintext

Breaking that down, we've got the following parameters in there (documented at https://www.mediawiki.org/wiki/Extension:TextExtracts#query+extracts):

action=query, format=json, and title=Bla_Bla_Bla are all standard MediaWiki API parameters
prop=extracts makes us use the TextExtracts extension
exintro limits the response to content before the first section heading
explaintext makes the extract in the response be plain text instead of HTML

Then parse the JSON response and extract the extract:

>>> import requests >>> response = requests.get( ...     'https://en.wikipedia.org/w/api.php', ...     params={ ...         'action': 'query', ...         'format': 'json', ...         'titles': 'Bla Bla Bla', ...         'prop': 'extracts', ...         'exintro': True, ...         'explaintext': True, ...     } ... ).json() >>> page = next(iter(response['query']['pages'].values())) >>> print(page['extract']) "Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You) in its US radio version.

2. Get the full HTML of the page using the `parse` endpoint, parse it, and extract the first paragraph

MediaWiki has a parse endpoint that you can hit with a URL like https://en.wikipedia.org/w/api.php?action=parse&page=Bla_Bla_Bla to get the HTML of a page. You can then parse it with an HTML parser like lxml (install it first with pip install lxml) to extract the first paragraph.

For example:

>>> import requests >>> from lxml import html >>> response = requests.get( ...     'https://en.wikipedia.org/w/api.php', ...     params={ ...         'action': 'parse', ...         'page': 'Bla Bla Bla', ...         'format': 'json', ...     } ... ).json() >>> raw_html = response['parse']['text']['*'] >>> document = html.document_fromstring(raw_html) >>> first_p = document.xpath('//p')[0] >>> intro_text = first_p.text_content() >>> print(intro_text) "Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You) in its US radio version.

3. Parse wikitext yourself

You can use the query API to get the page's wikitext, parse it using mwparserfromhell (install it first using pip install mwparserfromhell), then reduce it down to human-readable text using strip_code. strip_code doesn't work perfectly at the time of writing (as shown clearly in the example below) but will hopefully improve.

>>> import requests >>> import mwparserfromhell >>> response = requests.get( ...     'https://en.wikipedia.org/w/api.php', ...     params={ ...         'action': 'query', ...         'format': 'json', ...         'titles': 'Bla Bla Bla', ...         'prop': 'revisions', ...         'rvprop': 'content', ...     } ... ).json() >>> page = next(iter(response['query']['pages'].values())) >>> wikicode = page['revisions'][0]['*'] >>> parsed_wikicode = mwparserfromhell.parse(wikicode) >>> print(parsed_wikicode.strip_code()) {{dablink|For Ke$ha's song, see Blah Blah Blah (song). For other uses, see Blah (disambiguation)}}  "Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You) in its US radio version.  Background and writing He described this song as "a piece I wrote thinking of all the people who talk and talk without saying anything". The prominent but nonsensical vocal samples are taken from UK band Stretch's song "Why Did You Do It"''.  Music video The song also featured a popular music video in the style of La Linea. The music video shows a man with a floating head and no arms walking toward what appears to be a shark that multiplies itself and can change direction. This style was also used in "The Riddle", another song by Gigi D'Agostino, originally from British singer Nik Kershaw.  Chart performance Chart (1999-00)PeakpositionIreland (IRMA)Search for Irish peaks23  References  External links   Category:1999 singles Category:Gigi D'Agostino songs Category:1999 songs Category:ZYX Music singles Category:Songs written by Gigi D'Agostino

130

answered Oct 02 '22 04:10

Mark Amery

Use the MediaWiki API, which runs on Wikipedia. You will have to do some parsing of the data yourself.

For instance:

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&&titles=Bla%20Bla%20Bla

means

fetch (action=query) the content (rvprop=content) of the most recent revision of Main Page (title=Main%20Page) in JSON format (format=json).

You will probably want to search for the query and use the first result, to handle spelling errors and the like.

answered Oct 02 '22 04:10

Katriel

Related questions
                            
                                Why Conda cannot call correct Python version after activating the environment?
                            
                                how to measure execution time of functions (automatically) in Python
                            
                                Speeding up pairing of strings into objects in Python
                            
                                Best way to count the number of rows with missing values in a pandas DataFrame
                            
                                How to generate url from boto3 in amazon web services
                            
                                Beautiful Soup if Class "Contains" or Regex?
                            
                                Can I use index information inside the map function?
                            
                                Python atan or atan2, what should I use?
                            
                                How to graph grid scores from GridSearchCV?
                            
                                Pandas Extract Number from String
                            
                                Python natural smoothing splines
                            
                                Jupyter command `jupyter-lab` not found
                            
                                Syntax to call random function from a list [duplicate]
                            
                                How do I add a header to urllib2 opener?
                            
                                How to install virtualenv without using sudo?
                            
                                Retrieve all items from DynamoDB using query?
                            
                                Set column name for apply result over groupby
                            
                                Pass a dict to scikit learn estimator
                            
                                Reading JSON file with Python 3
                            
                                Django equivalent of COUNT with GROUP BY

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to get plain text out of Wikipedia

Tags:

python

wikipedia

wikipedia-api

mediawiki-api

mediawiki

Wifi

People also ask

2 Answers

1. Get a plain text representation of either the entire page or the page "extract" straight from the API with the `extracts` prop

2. Get the full HTML of the page using the `parse` endpoint, parse it, and extract the first paragraph

3. Parse wikitext yourself

Mark Amery

Katriel

Recent Activity

Donate For Us

How to get plain text out of Wikipedia

Tags:

python

wikipedia

wikipedia-api

mediawiki-api

mediawiki

Wifi

People also ask

2 Answers

1. Get a plain text representation of either the entire page or the page "extract" straight from the API with the extracts prop

2. Get the full HTML of the page using the parse endpoint, parse it, and extract the first paragraph

3. Parse wikitext yourself

Mark Amery

Katriel

Related questions

Recent Activity

Donate For Us

1. Get a plain text representation of either the entire page or the page "extract" straight from the API with the `extracts` prop

2. Get the full HTML of the page using the `parse` endpoint, parse it, and extract the first paragraph