I am looking for a way to parse wikipedia dumps and retrieve the hyper-links found in each page. My main objective is create a directed-graph on possible paths for going from one wikipedia page to another.
For example: The page definition of "Dog" has a link to "Canis lupus". So I would have a Dog-> Canis Lupus as output.
PS: I would prefer python libraries if there are any.
The simplest way would be to use the dump that already contains information about links between pages: pagelinks.sql. To use it, you would import it into a MySQL database and then you can access that database from any language. To make sense of the data in that dump, you will also need to import page.sql.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With