Getting Wikipedia infoboxes in a format that Ruby can understand

Question

I am trying to get the data from Wikipedia's infoboxes into a hash or something so that I can use it in my Ruby on Rails program. Specifically I'm interested in the Infobox company and Infobox person. The example I have been using is "Ford Motor Company". I want to get the company info for that and the person info for the people linked to in Ford's company box.

I've tried figuring out how to do this from the Wikipedia API or DBPedia but I haven't had much luck. I know wikipedia can return some things as json which I could parse with ruby but I haven't been able to figure out how to get the infobox. In the case of DBPedia I am kind of lost on how to even query it to get the info for Ford Motor Company.

jimkont · Accepted Answer

I vote for DBpedia.

A simple explanation is:

The dbpedia naming scheme is http://dbpedia.org/resource/WikipediaArticleName (unique identifier) with spaces replaced by _.

http://dbpedia.org/page/ArticleName (the html preview) and http://dbpedia.org/data/ArticleName(.json/.jsod) are the JSON representation for the information about the article you want. (.rdf etc. might be confusing for you right now.)

For Ford Motor Company you should ask for:

http://dbpedia.org/data/Ford_Motor_Company.json

or:

http://dbpedia.org/data/Ford_Motor_Company.jsod

(Whichever is simpler for you)

Now, depending on the article type, person or company, there are different properties that define them that depend on the dbpedia ontology (http://wiki.dbpedia.org/Ontology).

A more advanced step could be to use SPARQL queries to get your data.

BeepDog · Answer

Don't try to parse HTML with RegExp.

See: RegEx match open tags except XHTML self-contained tags

Use xpath or something similar.

Getting Wikipedia infoboxes in a format that Ruby can understand

Tags:

ruby

web-scraping

wikipedia

mediawiki-api

dbpedia

hadees

2 Answers

jimkont

BeepDog

Recent Activity

Donate For Us

Getting Wikipedia infoboxes in a format that Ruby can understand

Tags:

ruby

web-scraping

wikipedia

mediawiki-api

dbpedia

hadees

2 Answers

jimkont

BeepDog

Related questions

Recent Activity

Donate For Us