I got a Wikipedia-Article and I want to fetch the first z lines (or the first x chars, or the first y words, doesn't matter) from the article. The problem: I can get either the source Wiki-Text (via API) or the parsed HTML (via direct HTTP-Request, eventually on the print-version) but how can I find the first lines displayed? Normaly the source (both html and wikitext) starts with the info-boxes and images and the first real text to display is somewhere down in the code. For example: Albert Einstein on Wikipedia (print Version). Look in the code, the first real-text-line "Albert Einstein (pronounced /ˈælbərt ˈaɪnstaɪn/; German: [ˈalbɐt ˈaɪ̯nʃtaɪ̯n]; 14 March 1879–18 April 1955) was a theoretical physicist." is not on the start. The same applies to the Wiki-Source, it starts with the same info-box and so on. So how would you accomplish this task? Programming language is java, but this shouldn't matter. A solution which came to my mind was to use an xpath query but this query would be rather complicated to handle all the border-cases. [update]It wasn't that complicated, see my solution below![/update] Thanks!

You don't need to. The API's <code>exintro</code> parameter returns only the first (zeroth) section of the article. Example: api.php?action=query&prop=extracts&exintro&explaintext&titles=Albert%20Einstein There are other parameters, too: <ul> <li> <code>exchars</code> Length of extracts in characters.</li> <li> <code>exsentences</code> Number of sentences to return.</li> <li> <code>exintro</code> Return only zeroth section.</li> <li> <code>exsectionformat</code> What section heading format to use for plaintext extracts: <pre class="prettyprint"><code>wiki — e.g., == Wikitext == plain — no special decoration raw — this extension's internal representation </code></pre> </li> <li> <code>exlimit</code> Maximum number of extracts to return. Because excerpts generation can be slow, the limit is capped at 20 for intro-only extracts and 1 for whole-page extracts.</li> <li> <code>explaintext</code> Return plain-text extracts.</li> <li> <code>excontinue</code> When more results are available, use this parameter to continue. </li> </ul> Source: https://www.mediawiki.org/wiki/Extension:MobileFrontend#prop.3Dextracts

Get first lines of Wikipedia Article

Tags:

parsing

wikipedia

wikipedia-api

I got a Wikipedia-Article and I want to fetch the first z lines (or the first x chars, or the first y words, doesn't matter) from the article.

The problem: I can get either the source Wiki-Text (via API) or the parsed HTML (via direct HTTP-Request, eventually on the print-version) but how can I find the first lines displayed? Normaly the source (both html and wikitext) starts with the info-boxes and images and the first real text to display is somewhere down in the code.

For example: Albert Einstein on Wikipedia (print Version). Look in the code, the first real-text-line "Albert Einstein (pronounced /ˈælbərt ˈaɪnstaɪn/; German: [ˈalbɐt ˈaɪ̯nʃtaɪ̯n]; 14 March 1879–18 April 1955) was a theoretical physicist." is not on the start. The same applies to the Wiki-Source, it starts with the same info-box and so on.

So how would you accomplish this task? Programming language is java, but this shouldn't matter.

A solution which came to my mind was to use an xpath query but this query would be rather complicated to handle all the border-cases. [update]It wasn't that complicated, see my solution below![/update]

Thanks!

492

asked Oct 14 '09 10:10

theomega

2 Answers

You don't need to.

The API's exintro parameter returns only the first (zeroth) section of the article.

Example: api.php?action=query&prop=extracts&exintro&explaintext&titles=Albert%20Einstein

There are other parameters, too:

exchars Length of extracts in characters.
exsentences Number of sentences to return.
exintro Return only zeroth section.

exsectionformat What section heading format to use for plaintext extracts:

wiki — e.g., == Wikitext ==
plain — no special decoration
raw — this extension's internal representation

exlimit Maximum number of extracts to return. Because excerpts generation can be slow, the limit is capped at 20 for intro-only extracts and 1 for whole-page extracts.
explaintext Return plain-text extracts.
excontinue When more results are available, use this parameter to continue.

Source: https://www.mediawiki.org/wiki/Extension:MobileFrontend#prop.3Dextracts

176

answered Oct 18 '22 22:10

octosquidopus

I was also in the same need and wrote some Python code to do that.

The script downloads the wikipedia article with given name, parses it using BeautifulSoup and returns first few paragraphs.

Code is at http://github.com/anandology/sandbox/blob/master/wikisnip/wikisnip.py.

answered Oct 18 '22 21:10

Anand Chitipothu

Related questions
                            
                                How do I use Nokogiri::XML::Reader to parse large XML files?
                            
                                Is syntax-highlighting programming languages using regular expressions possible?
                            
                                Why does float.parse return wrong value?
                            
                                How to select next node using scrapy
                            
                                How to scrape tables in thousands of PDF files?
                            
                                Extract (parse) amount and description from BIZ (Transaction) sms
                            
                                Why do TryParse methods uses an out parameter and not a ref
                            
                                How to combine two strings (date and time) to a single DateTime
                            
                                Parse C# DateTime
                            
                                Parsers and Compilers for Dummies. Where to start? [duplicate]
                            
                                How to parse C++ source in Python? [closed]
                            
                                Why doesn't this use of Begin[] work?
                            
                                Parse XML TO JAVA POJO in efficient way
                            
                                How to get the parameter from a relative URL string in C#?
                            
                                Best algorithm for evaluating a mathematical expression?
                            
                                How to parse a .plist file with php?
                            
                                Parsing a pair of ints with boost spirit
                            
                                Parser library that can handle ambiguity
                            
                                How to manage parsing an null object for DateTime to be used with ADO.NET as DBNULL
                            
                                Parsing Post Form Data Node.js Express

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With