How to scrape the data from Wikipedia by category?

Tags:

wikipedia

I would like to use only medical data from Wikipedia for analysis. I use python for scraping. I have used this library for searching by word in query:

import wikipedia

import requests
import pprint
from bs4 import BeautifulSoup
wikipedia.set_lang("en")
query = raw_input()
WikiPage = wikipedia.page(title = query,auto_suggest = True)
cat = WikiPage.categories
for i in cat:
    print i

and get the categories.

But, my problem is vice versa:

I want to give category, for example: health or medical terminology and get all articles with this type.

how can I do this?

968

asked Nov 10 '15 11:11

1 Answers

Edit: actual answer

There is API:Categorymembers, which documents usage, parameters and gives examples on "how to retrieve lists of pages in a given category, ordered by title". It won't save you from having to descend through the category tree (cf. below) yourself, but you get a nice entry point and machine-readable results.

Old answer: related information

A very brief pointer is given on the Help:Category page, section Searching for articles in categories:

In addition to browsing through hierarchies of categories, it is possible to use the search tool to find specific articles in specific categories. To search for articles in a specific category, type incategory:"CategoryName" in the search box.

An "OR" can be added to join the contents of one category with the contents of another. For example, enter
    incategory:"Suspension bridges" OR incategory:"Bridges in New York City"
to return all pages that belong to either (or both) of the categories, as here.

Note that using search to find categories will not find articles which have been categorized using templates. This feature also doesn't return pages in subcategories.

To address the subcategory problem, the page Special:CategoryTree can be used instead. However, the page does not point to an obvious documentation. So I think the <form> fields must be manually searched for in the page source to create a programmatic API.

116

answered Sep 18 '22 12:09

ojdo

Related questions
                            
                                Does scikit-learn have Bayes Net ? If yes is there an implementation for reference
                            
                                Pandas groupby: Count the number of occurrences within a time range for each group
                            
                                Django Logging with Elastic Beanstalk (AWS)
                            
                                Python Bokeh: Set line color based on column in columndatasource
                            
                                requests.exceptions.SSLError: hostname 'boxfwd.com' doesn't match either of 'nycmsk.com', 'www.nycmsk.com'
                            
                                Equivalent to \Sexpr{} for Python, etc., in knitr + RMarkdown?
                            
                                Forcing `None` on load and skipping `None` on dump
                            
                                Adaptable descriptor in Python
                            
                                Python counterpart to partial for ignoring an argument
                            
                                Is there any way in Django Rest framework serializer for ignoring case in choice field?
                            
                                How to get SciPy.integrate.odeint to stop when path is closed?
                            
                                get_includes doesn't find standard library headers
                            
                                Plotting a dictionary of DataFrames
                            
                                Python : sklearn svm, providing a custom loss function
                            
                                Differentiating an equation
                            
                                Flask: Set header on static files
                            
                                Downloading images with gevent
                            
                                Scrapy: 503 Error when scraping site using CloudFlare
                            
                                Fetching the first image from a website that belongs to the post

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to scrape the data from Wikipedia by category?

Tags:

python

wikipedia

Татьяна Паскевич

People also ask

1 Answers

Edit: actual answer

Old answer: related information

ojdo

Recent Activity

Donate For Us