Can anyone point me to some large corpus that I use for classification?
But by large I don't mean Reuters or 20 newsgroups, I'm talking about a corpus of GB size, not 20MB or something like that.
I was able only to find this Reuters and 20 newsgroups, which is very small for the thing I need.
Document classification can be manual (as it is in library science) or automated (within the field of computer science), and is used to easily sort and manage texts, images or videos. Both types of document classification have their advantages and disadvantages.
It is concluded that KNN classifiers have been recognized as the best algorithm for document classification with a percentage accuracy of 99.85%, recall value of 100%, and f-Score of 0.997.
Document Classification helps you to apply machine learning to automate the management and processing of large amounts of business documents. With customized classification models, you can use the service in a wide range of business scenarios and adapt it to your special requirements.
The most popular datasets for text-classification evaluation are:
However the datasets above does not meet the 'large' requirement. Below datasets might meet your criteria:
Commoncrawl You could build a large corpus by extracting articles that have specific keywords in the meta tag and apply to document classification.
Enron Email Dataset You could do a variety of different classifcation tasks here.
Topic Annotated Enron Dataset . Not free but already labelled and meets your large corpus request
You can browse other publicly available datasets here
Other than the above you might have to develop your own corpus.I will be releasing a news corpus builder later this weekend that will help you develop custom corpora based on topics of your choice
Update:
Had created the custom corpus builder module I mentioned above but forgot to link it News Corpus Builder
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With