I've been searching for hours on how to extract the main text of a Wikipedia article, without all the links and references. I've tried wikitools, mwlib, BeautifulSoup and more. But I haven't really managed to.
Is there any easy and fast way for me to take the clear text (the actual article), and put it in a Python variable?
SOLUTION: Omid Raha solved it :)
In order to extract data from Wikipedia, we have to first import the wikipedia library in Python using 'pip install wikipedia'. In this program, we will extract the summary of Python Programming from Wikipedia and print it inside a textbox.
This is a fun gimmick and Wikipedia is pretty lenient when it comes to web scraping. There are also harder to scrape websites such as Amazon or Google. If you want to scrape such a website, you should set up a system with headless Chrome browsers and proxy servers.
Wikipedia is a Python library that makes it easy to access and parse data from Wikipedia. Search Wikipedia, get article summaries, get data like links and images from a page, and more. Wikipedia wraps the MediaWiki API so you can focus on using Wikipedia data, not getting it.
You can use this package, that is a python wrapper for Wikipedia API,
Here is a quick start.
First install it:
pip install wikipedia
Example:
import wikipedia
p = wikipedia.page("Python programming language")
print(p.url)
print(p.title)
content = p.content # Content of page.
Output:
http://en.wikipedia.org/wiki/Python_(programming_language)
Python (programming language)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With