Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract the first paragraph from a Wikipedia article (Python)

How can I extract the first paragraph from a Wikipedia article, using Python?

For example, for Albert Einstein, that would be:

Albert Einstein (pronounced /ˈælbərt ˈaɪnstaɪn/; German: [ˈalbɐt ˈaɪnʃtaɪn] ( listen); 14 March 1879 – 18 April 1955) was a theoretical physicist, philosopher and author who is widely regarded as one of the most influential and iconic scientists and intellectuals of all time. A German-Swiss Nobel laureate, Einstein is often regarded as the father of modern physics.[2] He received the 1921 Nobel Prize in Physics "for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect".[3]

like image 939
Alon Gubkin Avatar asked Dec 16 '10 12:12

Alon Gubkin


People also ask

How do I get plain text from Wikipedia?

explaintext => Return extracts as plain text instead of limited HTML. exlimit = max (now its 20); Otherwise only one result will return. exintro => Return only content before the first section. If you want full data, just remove this.

How do I get information from Wikipedia in Python?

In order to extract data from Wikipedia, we have to first import the wikipedia library in Python using 'pip install wikipedia'. In this program, we will extract the summary of Python Programming from Wikipedia and print it inside a textbox.

How do I extract information from Wikipedia?

Just extract Wikipedia data via Google Spreadsheets, download all the data from the sheet to your laptop, and open it in Excel or LibreOffice. Google AdWords Keyword Planner suggests keywords with the commercial or transactional intent, unless you dig deep and use highly specific keywords in the input.


2 Answers

I wrote a Python library that aims to make this very easy. Check it out at Github.

To install it, run

$ pip install wikipedia 

Then to get the first paragraph of an article, just use the wikipedia.summary function.

>>> import wikipedia >>> print wikipedia.summary("Albert Einstein", sentences=2) 

prints

Albert Einstein (/ˈælbərt ˈaɪnstaɪn/; German: [ˈalbɐt ˈaɪnʃtaɪn] ( listen); 14 March 1879 – 18 April 1955) was a German-born theoretical physicist who developed the general theory of relativity, one of the two pillars of modern physics (alongside quantum mechanics). While best known for his mass–energy equivalence formula E = mc2 (which has been dubbed "the world's most famous equation"), he received the 1921 Nobel Prize in Physics "for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect".

As far as how it works, wikipedia makes a request to the Mobile Frontend Extension of the MediaWiki API, which returns mobile friendly versions of Wikipedia articles. To be specific, by passing the parameters prop=extracts&exsectionformat=plain, the MediaWiki servers will parse the Wikitext and return a plain text summary of the article you are requesting, up to and including the entire page text. It also accepts the parameters exchars and exsentences, which, not surprisingly, limit the number of characters and sentences returned by the API.

like image 123
goldsmith Avatar answered Sep 22 '22 01:09

goldsmith


Some time ago I made two classes for get Wikipedia articles in plain text. I know that they aren't the best solution, but you can adapt it to your needs:

    wikipedia.py
    wiki2plain.py

You can use it like this:

from wikipedia import Wikipedia from wiki2plain import Wiki2Plain  lang = 'simple' wiki = Wikipedia(lang)  try:     raw = wiki.article('Uruguay') except:     raw = None  if raw:     wiki2plain = Wiki2Plain(raw)     content = wiki2plain.text 
like image 32
joksnet Avatar answered Sep 21 '22 01:09

joksnet