I am working with some really large databases of newspaper articles, I have them in a MySQL database, and I can query them all. I am now searching for ways to help me tag these articles with somewhat descriptive tags. All these articles is accessible from a URL that looks like this: <pre class="prettyprint"><code>http://web.site/CATEGORY/this-is-the-title-slug </code></pre> So at least I can use the category to figure what type of content that we are working with. However, I also want to tag based on the article-text. My initial approach was doing this: <ol> <li>Get all articles</li> <li>Get all words, remove all punctuation, split by space, and count them by occurrence </li> <li>Analyze them, and filter common non-descriptive words out like "them", "I", "this", "these", "their" etc.</li> <li>When all the common words was filtered out, the only thing left is words that is tag-worthy.</li> </ol> But this turned out to be a rather manual task, and not a very pretty or helpful approach. This also suffered from the problem of words or names that are split by space, for example if 1.000 articles contains the name "John Doe", and 1.000 articles contains the name of "John Hanson", I would only get the word "John" out of it, not his first name, and last name.

Automatically tagging articles is really a research problem and you can spend a lot of time re-inventing the wheel when others have already done much of the work. I'd advise using one of the existing natural language processing toolkits like NLTK. To get started, I would suggest looking at implementing a proper Tokeniser (much better than splitting by whitespace), and then take a look at Chunking and Stemming algorithms. You might also want to count frequencies for n-grams, i.e. a sequences of words, instead of individual words. This would take care of "words split by a space". Toolkits like NLTK have functions in-built for this. Finally, as you iteratively improve your algorithm, you might want to train on a random subset of the database and then try how the algorithm tags the remaining set of articles to see how well it works.

You should use a metric such as tf-idf to get the tags out: <ol> <li>Count the frequency of each term per document. This is the term frequency, tf(t, D). The more often a term occurs in the document D, the more important it is for D.</li> <li>Count, per term, the number of documents the term appears in. This is the document frequency, df(t). The higher df, the less the term discriminates among your documents and the less interesting it is.</li> <li>Divide tf by the log of df: tfidf(t, D) = tf(t, D) / log(df(D) + 1).</li> <li>For each document, declare the top k terms by their tf-idf score to be the tags for that document.</li> </ol> Various implementations of tf-idf are available; for Java and .NET, there's Lucene, for Python there's scikits.learn. If you want to do better than this, use language models. That requires some knowledge of probability theory.

How to auto-tag content, algorithms and suggestions needed

Tags:

nlp

tagging

I am working with some really large databases of newspaper articles, I have them in a MySQL database, and I can query them all.

I am now searching for ways to help me tag these articles with somewhat descriptive tags.

All these articles is accessible from a URL that looks like this:

http://web.site/CATEGORY/this-is-the-title-slug

So at least I can use the category to figure what type of content that we are working with. However, I also want to tag based on the article-text.

My initial approach was doing this:

Get all articles
Get all words, remove all punctuation, split by space, and count them by occurrence
Analyze them, and filter common non-descriptive words out like "them", "I", "this", "these", "their" etc.
When all the common words was filtered out, the only thing left is words that is tag-worthy.

But this turned out to be a rather manual task, and not a very pretty or helpful approach.

This also suffered from the problem of words or names that are split by space, for example if 1.000 articles contains the name "John Doe", and 1.000 articles contains the name of "John Hanson", I would only get the word "John" out of it, not his first name, and last name.

455

asked May 18 '11 02:05

Kasper Grubbe

2 Answers

Automatically tagging articles is really a research problem and you can spend a lot of time re-inventing the wheel when others have already done much of the work. I'd advise using one of the existing natural language processing toolkits like NLTK.

To get started, I would suggest looking at implementing a proper Tokeniser (much better than splitting by whitespace), and then take a look at Chunking and Stemming algorithms.

You might also want to count frequencies for n-grams, i.e. a sequences of words, instead of individual words. This would take care of "words split by a space". Toolkits like NLTK have functions in-built for this.

Finally, as you iteratively improve your algorithm, you might want to train on a random subset of the database and then try how the algorithm tags the remaining set of articles to see how well it works.

answered Sep 27 '22 23:09

Anupam Jain

You should use a metric such as tf-idf to get the tags out:

Count the frequency of each term per document. This is the term frequency, tf(t, D). The more often a term occurs in the document D, the more important it is for D.
Count, per term, the number of documents the term appears in. This is the document frequency, df(t). The higher df, the less the term discriminates among your documents and the less interesting it is.
Divide tf by the log of df: tfidf(t, D) = tf(t, D) / log(df(D) + 1).
For each document, declare the top k terms by their tf-idf score to be the tags for that document.

Various implementations of tf-idf are available; for Java and .NET, there's Lucene, for Python there's scikits.learn.

If you want to do better than this, use language models. That requires some knowledge of probability theory.

answered Sep 27 '22 23:09

Fred Foo

Related questions
                            
                                Strip all HTML tags except links
                            
                                How to list SVN tags and its revisions from command line
                            
                                Can I get Jenkins to build a git tag from a passed in parameter?
                            
                                Test if children tag exists in beautifulsoup
                            
                                HTML audio tag volume
                            
                                How to use multiple image tags with docker-compose
                            
                                How do I create tag with certain commits and push it to origin?
                            
                                Best practice for storing tags in a database?
                            
                                Mysql join query for multiple "tags" (many-to-many relationship) that matches ALL tags?
                            
                                What does a colon mean within an HTML id attribute?
                            
                                Using filters in Liquid tags
                            
                                Git Publisher "target remote name" validation problems, single repo
                            
                                jquery html() strips out script tags
                            
                                Ways to implement tags - pros and cons of each
                            
                                Git: distinguish between local and remote tags
                            
                                What's the difference between the HTML width / height attribute and the CSS width / height property on the img element?
                            
                                Bootstrap and HTML5 Semantic tags
                            
                                HTML Agility Pack strip tags NOT IN whitelist
                            
                                git fatal:No tags can describe <sha1 number>
                            
                                BeautifulSoup: get tag name of element itself, not its children

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With