Mining Wikipedia for mapping relations for text mining

Question

I am planning to develop a web-based application which could crawl wikipedia for finding relations and store it in a database. By relations, I mean searching for a name say,'Bill Gates' and find his page, download it and pull out the various information from the page and store it in a database. Information may include his date of birth, his company and a few other things. But I need to know if there is any way to find these unique data from the page, so that I could store them in a database. Any specific books or algorithms would be greatly appreciated. Also mentioning of good opensource libraries would be helpful.

Thank You

HostileFork says dont trust SE · Accepted Answer

If you haven't already, you should have a look at DBpedia. Many categories of wiki articles have "Infoboxes" for the kinds of information you describe, and they've made a database out of it:

http://en.wikipedia.org/wiki/DBpedia

You might also leverage some of the information in Metaweb's Freebase (which overlaps and I believe may even integrate the info from DBpedia.) They have an API for querying their graph database, and there's a Python wrapper for it called freebase-python.

UPDATE: Freebase is no more; they were acquired by Google and eventually folded into the Google Knowledge Graph. There is an API but I don't think they have anything like the formal sync'ing Freebase had with public sources like Wikipedia. I'm personally disappointed in how this looks to have turned out. :-/

As for the natural language processing bit, if you do make headway on that problem you might consider these databases as repositories for any information you do mine.

Mining Wikipedia for mapping relations for text mining

Tags:

python

pattern-matching

text-mining

data-mining

wikipedia

jvc

1 Answers

HostileFork says dont trust SE

Recent Activity

Donate For Us

Mining Wikipedia for mapping relations for text mining

Tags:

python

pattern-matching

text-mining

data-mining

wikipedia

jvc

1 Answers

HostileFork says dont trust SE

Related questions

Recent Activity

Donate For Us