Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mining Wikipedia for mapping relations for text mining

I am planning to develop a web-based application which could crawl wikipedia for finding relations and store it in a database. By relations, I mean searching for a name say,'Bill Gates' and find his page, download it and pull out the various information from the page and store it in a database. Information may include his date of birth, his company and a few other things. But I need to know if there is any way to find these unique data from the page, so that I could store them in a database. Any specific books or algorithms would be greatly appreciated. Also mentioning of good opensource libraries would be helpful.

Thank You

like image 306
jvc Avatar asked Dec 11 '25 09:12

jvc


1 Answers

If you haven't already, you should have a look at DBpedia. Many categories of wiki articles have "Infoboxes" for the kinds of information you describe, and they've made a database out of it:

http://en.wikipedia.org/wiki/DBpedia

You might also leverage some of the information in Metaweb's Freebase (which overlaps and I believe may even integrate the info from DBpedia.) They have an API for querying their graph database, and there's a Python wrapper for it called freebase-python.

UPDATE: Freebase is no more; they were acquired by Google and eventually folded into the Google Knowledge Graph. There is an API but I don't think they have anything like the formal sync'ing Freebase had with public sources like Wikipedia. I'm personally disappointed in how this looks to have turned out. :-/

As for the natural language processing bit, if you do make headway on that problem you might consider these databases as repositories for any information you do mine.

like image 190
HostileFork says dont trust SE Avatar answered Dec 12 '25 22:12

HostileFork says dont trust SE



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!