Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Large classification document corpus

Can anyone point me to some large corpus that I use for classification?

But by large I don't mean Reuters or 20 newsgroups, I'm talking about a corpus of GB size, not 20MB or something like that.

I was able only to find this Reuters and 20 newsgroups, which is very small for the thing I need.

like image 932
Kobe-Wan Kenobi Avatar asked Aug 27 '15 10:08

Kobe-Wan Kenobi


People also ask

What are the different kind of document classifications?

Document classification can be manual (as it is in library science) or automated (within the field of computer science), and is used to easily sort and manage texts, images or videos. Both types of document classification have their advantages and disadvantages.

Which algorithm is best for document classification?

It is concluded that KNN classifiers have been recognized as the best algorithm for document classification with a percentage accuracy of 99.85%, recall value of 100%, and f-Score of 0.997.

What is SAP document classification?

Document Classification helps you to apply machine learning to automate the management and processing of large amounts of business documents. With customized classification models, you can use the service in a wide range of business scenarios and adapt it to your special requirements.


1 Answers

The most popular datasets for text-classification evaluation are:

  • Reuters Dataset
  • 20 Newsgroup Dataset

However the datasets above does not meet the 'large' requirement. Below datasets might meet your criteria:

  • Commoncrawl You could build a large corpus by extracting articles that have specific keywords in the meta tag and apply to document classification.

  • Enron Email Dataset You could do a variety of different classifcation tasks here.

  • Topic Annotated Enron Dataset . Not free but already labelled and meets your large corpus request

You can browse other publicly available datasets here

Other than the above you might have to develop your own corpus.I will be releasing a news corpus builder later this weekend that will help you develop custom corpora based on topics of your choice

Update:

Had created the custom corpus builder module I mentioned above but forgot to link it News Corpus Builder

like image 194
Skillachie Avatar answered Sep 25 '22 22:09

Skillachie