Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I programmatically generate relevant tags for a database of URLs?

I'm writing an RSS reader in python as a learning exercise, and I would really like to be able to tag individual entries with keywords for searching. Unfortunately, most real-world feeds don't include keyword metadata. I currently have about 60,000 entries in my test database from about 600 feeds, so manually tagging is not going to be effective. So far I have only been able to find two solutions:

1: Use Natural Language Toolkit to extract keywords:

  • Pros: flexible; no dependencies on external services;
  • Cons: can only index the article summary, not the article; non-trivial: writing a high quality keyword extraction tool is a project in itself;

2: Use the Google Adwords API to fetch keyword suggestions from the article url:

  • Pros: Super high quality keywords; based on entire article text; easy to use;
  • Cons: Not free(?); Query rate limits unknown; I'm terrified of getting my account banned and not being able to run adwords campaigns for my commercial sites;

Can anyone offer any suggestions? Are my fears about getting my adwords account banned unfounded?

like image 423
Parker Ault Avatar asked Oct 25 '22 22:10

Parker Ault


1 Answers

There are a number of free and commercial text annotation tools/services you might consider, depending on your specific needs, listed under:

Is there a better tool than OpenCalais?.

A number of these provide entities, some provide a measure of keyword relevance, and others provide topic tags.

like image 57
John Lehmann Avatar answered Dec 02 '22 19:12

John Lehmann