Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get first lines of Wikipedia Article

I got a Wikipedia-Article and I want to fetch the first z lines (or the first x chars, or the first y words, doesn't matter) from the article.

The problem: I can get either the source Wiki-Text (via API) or the parsed HTML (via direct HTTP-Request, eventually on the print-version) but how can I find the first lines displayed? Normaly the source (both html and wikitext) starts with the info-boxes and images and the first real text to display is somewhere down in the code.

For example: Albert Einstein on Wikipedia (print Version). Look in the code, the first real-text-line "Albert Einstein (pronounced /ˈælbərt ˈaɪnstaɪn/; German: [ˈalbɐt ˈaɪ̯nʃtaɪ̯n]; 14 March 1879–18 April 1955) was a theoretical physicist." is not on the start. The same applies to the Wiki-Source, it starts with the same info-box and so on.

So how would you accomplish this task? Programming language is java, but this shouldn't matter.

A solution which came to my mind was to use an xpath query but this query would be rather complicated to handle all the border-cases. [update]It wasn't that complicated, see my solution below![/update]

Thanks!

like image 492
theomega Avatar asked Oct 14 '09 10:10

theomega


People also ask

How do you get a Wikipedia entry?

To create a new page, all you need to do is create an account on Wikipedia, and then add your new article. While only registered and signed-in users can create pages, anyone can modify a page, and the edits are simply attributed to their IP address.

What happens if you click the first link on Wikipedia?

Clicking on the first link in the main text of an English Wikipedia article, and then repeating the process for subsequent articles, usually leads to the Philosophy article.

What is the Wikipedia trick?

Here's something strange, but it really works… Go to Wikipedia, any random article will do. Click the first link of any article, but skip anything in parentheses (brackets). Repeat this and you will eventually end up on Philosophy.

What is the first entry in Wikipedia?

The first edits ever made on Wikipedia are believed to be test edits by Wales, however the oldest article still preserved is (as documented at Wikipedia:Wikipedia's oldest articles) the article UuU, created by the user Eiffel.demon.co.uk on 16 January 2001, at 21:08 UTC.


2 Answers

You don't need to.

The API's exintro parameter returns only the first (zeroth) section of the article.

Example: api.php?action=query&prop=extracts&exintro&explaintext&titles=Albert%20Einstein

There are other parameters, too:

  • exchars Length of extracts in characters.
  • exsentences Number of sentences to return.
  • exintro Return only zeroth section.
  • exsectionformat What section heading format to use for plaintext extracts:

    wiki — e.g., == Wikitext ==
    plain — no special decoration
    raw — this extension's internal representation
    
  • exlimit Maximum number of extracts to return. Because excerpts generation can be slow, the limit is capped at 20 for intro-only extracts and 1 for whole-page extracts.
  • explaintext Return plain-text extracts.
  • excontinue When more results are available, use this parameter to continue.

Source: https://www.mediawiki.org/wiki/Extension:MobileFrontend#prop.3Dextracts

like image 176
octosquidopus Avatar answered Oct 18 '22 22:10

octosquidopus


I was also in the same need and wrote some Python code to do that.

The script downloads the wikipedia article with given name, parses it using BeautifulSoup and returns first few paragraphs.

Code is at http://github.com/anandology/sandbox/blob/master/wikisnip/wikisnip.py.

like image 40
Anand Chitipothu Avatar answered Oct 18 '22 21:10

Anand Chitipothu