Can anyone point me to some large corpus that I use for classification? But by large I don't mean Reuters or 20 newsgroups, I'm talking about a corpus of GB size, not 20MB or something like that. I was able only to find this Reuters and 20 newsgroups, which is very small for the thing I need.

The most popular datasets for text-classification evaluation are: <ul> <li>Reuters Dataset</li> <li>20 Newsgroup Dataset</li> </ul> However the datasets above does not meet the 'large' requirement. Below datasets might meet your criteria: <ul> <li>Commoncrawl You could build a large corpus by extracting articles that have specific keywords in the meta tag and apply to document classification.</li> <li>Enron Email Dataset You could do a variety of different classifcation tasks here.</li> <li>Topic Annotated Enron Dataset . Not free but already labelled and meets your large corpus request</li> </ul> You can browse other publicly available datasets here Other than the above you might have to develop your own corpus.I will be releasing a news corpus builder later this weekend that will help you develop custom corpora based on topics of your choice Update: Had created the custom corpus builder module I mentioned above but forgot to link it News Corpus Builder

Large classification document corpus

1 Answers

The most popular datasets for text-classification evaluation are:

Reuters Dataset
20 Newsgroup Dataset

However the datasets above does not meet the 'large' requirement. Below datasets might meet your criteria:

Commoncrawl You could build a large corpus by extracting articles that have specific keywords in the meta tag and apply to document classification.
Enron Email Dataset You could do a variety of different classifcation tasks here.
Topic Annotated Enron Dataset . Not free but already labelled and meets your large corpus request

You can browse other publicly available datasets here

Other than the above you might have to develop your own corpus.I will be releasing a news corpus builder later this weekend that will help you develop custom corpora based on topics of your choice

Update:

Had created the custom corpus builder module I mentioned above but forgot to link it News Corpus Builder

194

answered Sep 25 '22 22:09

Skillachie

Related questions
                            
                                Add an existing DataSet to Report .rdlc
                            
                                Crash while adding DataSets to a RDLC Report
                            
                                .Net Add Index to Datatable (dataset)
                            
                                How to convert nested List to Dataset in C#
                            
                                Spark 2.0 Dataset Encoder with trait
                            
                                how to shuffle a Concatenated Tensorflow dataset
                            
                                How can create a function using variables in a dataframe
                            
                                how to return single row using TableAdapter
                            
                                Create entire DataSet from existing stored procedure
                            
                                Process for comparing two datasets
                            
                                Result set is object for 1 record, array for many?
                            
                                Removing Rows Based on Not Enough Repeated Data in a Large Data Set in R
                            
                                Create SQL Server DB from DataSet
                            
                                What are effective preprocessing methods for reducing data set size (e.g., removing records) without losing information for machine learning problems?
                            
                                How can I write an R script to check for straight-lining; i.e., whether, for any given row, all values in a set of columns have the same value
                            
                                TypeError: object of type 'numpy.int64' has no len()
                            
                                R calling a dataset in the package itself
                            
                                Export large amounts of data to client in asp.net
                            
                                How to insert line breaks dbunit dataset
                            
                                Supermarket dataset for Apriori algorithm

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Large classification document corpus

Tags:

dataset

classification

text-classification

corpus

Kobe-Wan Kenobi

People also ask

1 Answers

Skillachie

Recent Activity

Donate For Us