Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting Wikipedia infoboxes in a format that Ruby can understand

I am trying to get the data from Wikipedia's infoboxes into a hash or something so that I can use it in my Ruby on Rails program. Specifically I'm interested in the Infobox company and Infobox person. The example I have been using is "Ford Motor Company". I want to get the company info for that and the person info for the people linked to in Ford's company box.

I've tried figuring out how to do this from the Wikipedia API or DBPedia but I haven't had much luck. I know wikipedia can return some things as json which I could parse with ruby but I haven't been able to figure out how to get the infobox. In the case of DBPedia I am kind of lost on how to even query it to get the info for Ford Motor Company.

like image 586
hadees Avatar asked Dec 28 '22 03:12

hadees


2 Answers

I vote for DBpedia.

A simple explanation is:

The dbpedia naming scheme is http://dbpedia.org/resource/WikipediaArticleName (unique identifier) with spaces replaced by _.

http://dbpedia.org/page/ArticleName (the html preview) and http://dbpedia.org/data/ArticleName(.json/.jsod) are the JSON representation for the information about the article you want. (.rdf etc. might be confusing for you right now.)

For Ford Motor Company you should ask for:

http://dbpedia.org/data/Ford_Motor_Company.json

or:

http://dbpedia.org/data/Ford_Motor_Company.jsod

(Whichever is simpler for you)

Now, depending on the article type, person or company, there are different properties that define them that depend on the dbpedia ontology (http://wiki.dbpedia.org/Ontology).

A more advanced step could be to use SPARQL queries to get your data.

like image 113
jimkont Avatar answered Jan 13 '23 12:01

jimkont


Don't try to parse HTML with RegExp.

See: RegEx match open tags except XHTML self-contained tags

Use xpath or something similar.

like image 45
BeepDog Avatar answered Jan 13 '23 14:01

BeepDog