I want to classify large numbers (100K to 1M+) of smallish internet-based articles (tweets, blog articles, news, etc) by topic. Toward this goal, I have been looking for labeled training data documents which I could use to build classifier model(s). For the purpose of making this post most useful, here are some of the possible sources that I've found:
a) www.freebase.com/internet/website/category?instances=
b) wikipedia-miner.cms.waikato.ac.nz (a toolkit for accessing Wikipedia data)
c) en.wikipedia.org/wiki/Wikipedia:Database_download
d) wiki.dbpedia.org/About (SKOS formatted subject keywords belonging to categories)
e) internet search for a large article set, followed by clustering and manual curation
Question 1: Are there additional internet resources which could provide labeled training documents? Keyword sets on a given topic, especially weighted sets would also be useful
Ideally I would like to build a classifier which would return hierarchical categories and where sub-topic detail could be added at a later date as more interest/data becomes available.
Question 2: Are there topic modeling/classification frameworks which are hierarchically structured (and perhaps also extendable)? A code example would be particularly welcome
many thanks
The Reuters Corpus Volume 1 (search on RCV1-v2) it's about 800K Reuters articles from the late 1990's classified into topic, industry, and region categories by humans
an academic consortium (LDC) distributes various corpuses, including one compiled by the NY Times with ~1.5M labeled documents: http://catalog.ldc.upenn.edu/LDC2008T19
Lack of labeled data is an issue that plagues many applications of machine learning. To clarify, are you looking for a human who has looked at your tweets, blog articles, and news, labeled the source, and published that database? Or is it acceptable for a program to have done the classification? In the former case keywords seem like a good classification scheme but actually they are not: different people will choose different keywords for the same content. This will fundamentally harm your machine learning process.
My point is in either case you should use unsupervised learning (no labels provided) rather than supervised learning (labels provided) -- you should not be looking for labeled data because you won't find it. Even if you come across some data which has been labeled by a program, that program will probably have used unsupervised learning methods.
I recommend you use some of the functions defined in the cluster module of scikit-learn. These implement unsupervised learning techniques.
UC Irvine has a large repository of machine learning datasets. You can test some of your natural language processing work on some of their datasets. One popular dataset is the Enron email dataset. It and 4 others are compiled here.
UCI datasets are great but they are not in scikit-learn format. You will have to convert them. I usually use the iris dataset since it is small and you can play around with scikit-learn easily that way. As you can see in this example the line
est.fit(X)
requires only the data array X and no labels Y.
X = iris.data
assigns to X a 150_instances by 4_features numpy array. You need the data from UCI in this form. Let's look at the NYTimes news articles.
From the readme.txt at the UCI link note
For each text collection, D is the number of documents, W is the
number of words in the vocabulary, and N is the total number of words
in the collection (below, NNZ is the number of nonzero counts in the
bag-of-words). After tokenization and removal of stopwords, the
vocabulary of unique words was truncated by only keeping words that
occurred more than ten times.
...
NYTimes news articles:
orig source: ldc.upenn.edu
D=300000
W=102660
N=100,000,000 (approx)
That is, your X will have shape 300000_instances by 102660_features. Note the attribute format:
Attribute Information:
The format of the docword.*.txt file is 3 header lines, followed by
NNZ triples:
---
D
W
NNZ
docID wordID count
docID wordID count
docID wordID count
docID wordID count
...
docID wordID count
docID wordID count
docID wordID count
---
This data is in the docword.nytimes.txt data file. Some code to read it and run the clustering algorithm:
import numpy as np
from sklearn.cluster import KMeans
with open('docword.nytimes.txt','r') as f:
# read the header information
n_instances = int(f.readline())
n_attributes = int(f.readline())
n_nnz = int(f.readline())
# create scikit-learn X numpy array
X = np.zeros((n_instances, n_attributes))
for line in f:
doc_id, word_id, count = line.split()
X[doc_id, word_id] = count
# run sklearn clustering on nytimes data
n_clusters = 8
est = KMeans(n_clusters)
est.fit(X)
Unfortunately this requires a lot of memory. More memory than my machine has, actually, so I cannot test this code. Nevertheless, I imagine your application domain is comparable to this one. You will have to look in to some dimensionality reduction techniques or only look at smaller subsets of the words at a time.
I hope this helps. Feel free to message me.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With