Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use DBPedia to extract Tags/Keywords from content?

Tags:

I am exploring how I can use Wikipedia's taxonomy information to extract Tags/Keywords from my content.

I found articles about DBPedia. DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web.

Has anyone used their web services? Do you know how they work and how reliable it is?

like image 468
Pritam Raut Avatar asked Jan 20 '11 13:01

Pritam Raut


2 Answers

DBpedia is a fantastic, high quality resource. In order to turn your content into a set of relevant DBpedia concepts, however, you will need to accurately identify them in your text, which involves at least two steps:

  1. Identify DBpedia concepts in your content: This includes recognizing concept names (and alternate names) in text, and also disambiguating among all possible meanings of each phrase. The term "Sun" may refer to dozens of possible concepts according to its disambiguation page including a star, newspapers, person names, etc. This involves entity identification, classification, and linking.

  2. Identify which of those concepts are interesting: For example, do you want the concept "Definite article" showing up when text includes the term "the" (which The redirects to)?

You may want to consider a preexisting text analytics library or service, which supports entity linking to DBpedia. One great tool for topic indexing is Maui, which was developed by Alyona Medelyan during her PhD. Another great open source solution is Wikipedia Miner by David Milne at the same university.

Two commercial services which provide linking to DBpedia concepts are Zemanta and Extractiv (allow some level of free use). DBpedia spotlight option. Others which may provide these capabilities are listed at: https://stackoverflow.com/questions/2119279/is-there-a-better-tool-than-opencalais

Disclosure: I [used to] work at Extractiv (defunct), which is powered by Language Computer Corporation's NLP.

like image 166
John Lehmann Avatar answered Oct 07 '22 01:10

John Lehmann


You can use Apache Stanbol for this process. Entityhub component of Apache Stanbol provides producing custom DBPedia indexes based on your needs. Then you can use Enhancer component to extract Places, Persons, Locations entities from your text.

Following mail thread may be helpful for you.
http://markmail.org/message/52266yl5ohijxiof

You can access running demos of Apache Stanbol from the following link:
http://dev.iks-project.eu/

You can also ask your further questions to stanbol-dev AT incubator.apache.org.

like image 31
suat Avatar answered Oct 07 '22 02:10

suat