Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract the main article text from a Wikipedia page using Python [closed]

I've been searching for hours on how to extract the main text of a Wikipedia article, without all the links and references. I've tried wikitools, mwlib, BeautifulSoup and more. But I haven't really managed to.

Is there any easy and fast way for me to take the clear text (the actual article), and put it in a Python variable?

SOLUTION: Omid Raha solved it :)

like image 958
Paolo Avatar asked Apr 28 '14 21:04

Paolo


People also ask

How do I extract information from Wikipedia?

In order to extract data from Wikipedia, we have to first import the wikipedia library in Python using 'pip install wikipedia'. In this program, we will extract the summary of Python Programming from Wikipedia and print it inside a textbox.

Can we scrape data from Wikipedia?

This is a fun gimmick and Wikipedia is pretty lenient when it comes to web scraping. There are also harder to scrape websites such as Amazon or Google. If you want to scrape such a website, you should set up a system with headless Chrome browsers and proxy servers.

What is Wikipedia library in Python?

Wikipedia is a Python library that makes it easy to access and parse data from Wikipedia. Search Wikipedia, get article summaries, get data like links and images from a page, and more. Wikipedia wraps the MediaWiki API so you can focus on using Wikipedia data, not getting it.


1 Answers

You can use this package, that is a python wrapper for Wikipedia API,

Here is a quick start.

First install it:

pip install wikipedia

Example:

import wikipedia
p = wikipedia.page("Python programming language")
print(p.url)
print(p.title)
content = p.content # Content of page.

Output:

http://en.wikipedia.org/wiki/Python_(programming_language)
Python (programming language)
like image 112
Omid Raha Avatar answered Sep 24 '22 01:09

Omid Raha