Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to scrape the data from Wikipedia by category?

I would like to use only medical data from Wikipedia for analysis. I use python for scraping. I have used this library for searching by word in query:

import wikipedia

import requests
import pprint
from bs4 import BeautifulSoup
wikipedia.set_lang("en")
query = raw_input()
WikiPage = wikipedia.page(title = query,auto_suggest = True)
cat = WikiPage.categories
for i in cat:
    print i

and get the categories.

But, my problem is vice versa:

I want to give category, for example: health or medical terminology and get all articles with this type.

how can I do this?

like image 968
Татьяна Паскевич Avatar asked Nov 10 '15 11:11

Татьяна Паскевич


People also ask

What is the best way to scrape data from Wikipedia?

The machine-readable data on Wikipedia which can be seen in the info boxes comes from Wikidata. The Wikipedia page has a “Wikidata item” link if available. Thus there’s no need to scrape. Originally Answered: what is the best way to scrape from wikipedia? Firstly, you don't want to scrape wikipedia using a webcrawler.

How to extract data from Wikipedia in Excel?

Or perhaps, you might want to extract data from Wikipedia in a more convenient format, such as an Excel spreadsheet. Here’s where web scraping can help. With the help of a web scraper, you would be able to select the specific data you’d like to scrape from an article into a spreadsheet. No need to download the entire article.

How to extract data from Wikipedia using web crawling?

Web crawling is not the only way you can extract and analyze data from Wikipedia. For example, Wikimedia provides regular data dumps in a variety of formats. There is also the Wikimedia API which allows you to not only receive data from different wikis but also create bots and contribute to articles programmatically.

What is web scraping in web development?

Web scraping is also known as Screen Scraping, Web Data Extraction, Web Harvesting, etc. This helps programmers write clear, logical code for small and large-scale projects. Python is mostly known as the best web scraper language. It’s more like an all-rounder and can handle most of the web crawling related processes smoothly.


1 Answers

Edit: actual answer

There is API:Categorymembers, which documents usage, parameters and gives examples on "how to retrieve lists of pages in a given category, ordered by title". It won't save you from having to descend through the category tree (cf. below) yourself, but you get a nice entry point and machine-readable results.

Old answer: related information

A very brief pointer is given on the Help:Category page, section Searching for articles in categories:

In addition to browsing through hierarchies of categories, it is possible to use the search tool to find specific articles in specific categories. To search for articles in a specific category, type incategory:"CategoryName" in the search box.

An "OR" can be added to join the contents of one category with the contents of another. For example, enter

    incategory:"Suspension bridges" OR incategory:"Bridges in New York City"

to return all pages that belong to either (or both) of the categories, as here.

Note that using search to find categories will not find articles which have been categorized using templates. This feature also doesn't return pages in subcategories.

To address the subcategory problem, the page Special:CategoryTree can be used instead. However, the page does not point to an obvious documentation. So I think the <form> fields must be manually searched for in the page source to create a programmatic API.

like image 116
ojdo Avatar answered Sep 18 '22 12:09

ojdo