Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to Parse Wikipedia dumps to create links graph?

I am looking for a way to parse wikipedia dumps and retrieve the hyper-links found in each page. My main objective is create a directed-graph on possible paths for going from one wikipedia page to another.

For example: The page definition of "Dog" has a link to "Canis lupus". So I would have a Dog-> Canis Lupus as output.

PS: I would prefer python libraries if there are any.

like image 383
Pedro Avatar asked Nov 01 '22 13:11

Pedro


1 Answers

The simplest way would be to use the dump that already contains information about links between pages: pagelinks.sql. To use it, you would import it into a MySQL database and then you can access that database from any language. To make sense of the data in that dump, you will also need to import page.sql.

like image 164
svick Avatar answered Dec 14 '22 10:12

svick