I would like to use only medical data from Wikipedia for analysis. I use python for scraping. I have used this library for searching by word in query:
import wikipedia
import requests
import pprint
from bs4 import BeautifulSoup
wikipedia.set_lang("en")
query = raw_input()
WikiPage = wikipedia.page(title = query,auto_suggest = True)
cat = WikiPage.categories
for i in cat:
print i
and get the categories.
But, my problem is vice versa:
I want to give category, for example: health or medical terminology and get all articles with this type.
how can I do this?
The machine-readable data on Wikipedia which can be seen in the info boxes comes from Wikidata. The Wikipedia page has a “Wikidata item” link if available. Thus there’s no need to scrape. Originally Answered: what is the best way to scrape from wikipedia? Firstly, you don't want to scrape wikipedia using a webcrawler.
Or perhaps, you might want to extract data from Wikipedia in a more convenient format, such as an Excel spreadsheet. Here’s where web scraping can help. With the help of a web scraper, you would be able to select the specific data you’d like to scrape from an article into a spreadsheet. No need to download the entire article.
Web crawling is not the only way you can extract and analyze data from Wikipedia. For example, Wikimedia provides regular data dumps in a variety of formats. There is also the Wikimedia API which allows you to not only receive data from different wikis but also create bots and contribute to articles programmatically.
Web scraping is also known as Screen Scraping, Web Data Extraction, Web Harvesting, etc. This helps programmers write clear, logical code for small and large-scale projects. Python is mostly known as the best web scraper language. It’s more like an all-rounder and can handle most of the web crawling related processes smoothly.
There is API:Categorymembers, which documents usage, parameters and gives examples on "how to retrieve lists of pages in a given category, ordered by title". It won't save you from having to descend through the category tree (cf. below) yourself, but you get a nice entry point and machine-readable results.
A very brief pointer is given on the Help:Category page, section Searching for articles in categories:
In addition to browsing through hierarchies of categories, it is possible to use the search tool to find specific articles in specific categories. To search for articles in a specific category, type incategory:"CategoryName" in the search box.
An "OR" can be added to join the contents of one category with the contents of another. For example, enter
incategory:"Suspension bridges" OR incategory:"Bridges in New York City"
to return all pages that belong to either (or both) of the categories, as here.
Note that using search to find categories will not find articles which have been categorized using templates. This feature also doesn't return pages in subcategories.
To address the subcategory problem, the page Special:CategoryTree can be used instead. However, the page does not point to an obvious documentation. So I think the <form>
fields must be manually searched for in the page source to create a programmatic API.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With