Load Custom Dataset (which is like 20 news group set) in Scikit for Classification of text documents

Tags:

I'm trying to run this scikit example code for my custom dataset of Ted Talks. Each Directory is a Topic under which are text files which contain the description for each Ted Talk.

This is how my datasets tree structure is. As you see, each directory is a topic and below it are text files which carry description.

Topics/
|-- Activism
|   |-- 1149.txt
|   |-- 1444.txt
|   |-- 157.txt
|   |-- 1616.txt
|   |-- 1706.txt
|   |-- 1718.txt
|-- Adventure
|   |-- 1036.txt
|   |-- 1777.txt
|   |-- 2930.txt
|   |-- 2968.txt
|   |-- 3027.txt
|   |-- 3290.txt
|-- Advertising
|   |-- 3673.txt
|   |-- 3685.txt
|   |-- 6567.txt
|   `-- 6925.txt
|-- Africa
|   |-- 1045.txt
|   |-- 1072.txt
|   |-- 1103.txt
|   |-- 1112.txt
|-- Aging
|   |-- 1848.txt
|   |-- 2495.txt
|   |-- 2782.txt
|-- Agriculture
|   |-- 3469.txt
|   |-- 4140.txt
|   |-- 4733.txt
|   |-- 4939.txt

I have made my dataset in such form to resemble the 20news group whose tree structure is such:

20news-18828/
|-- alt.atheism
|   |-- 49960
|   |-- 51060
|   |-- 51119

|-- comp.graphics
|   |-- 37261
|   |-- 37913
|   |-- 37914
|   |-- 37915
|   |-- 37916
|   |-- 37917
|   |-- 37918
|-- comp.os.ms-windows.misc
|   |-- 10000
|   |-- 10001
|   |-- 10002
|   |-- 10003
|   |-- 10004
|   |-- 10005

In the original code (98-124), This is how training and testing data is loaded directly from scikit.

print("Loading 20 newsgroups dataset for categories:")
print(categories if categories else "all")

data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=remove)

data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=remove)
print('data loaded')

categories = data_train.target_names    # for case categories == None
def size_mb(docs):
    return sum(len(s.encode('utf-8')) for s in docs) / 1e6

data_train_size_mb = size_mb(data_train.data)
data_test_size_mb = size_mb(data_test.data)

print("%d documents - %0.3fMB (training set)" % (
    len(data_train.data), data_train_size_mb))
print("%d documents - %0.3fMB (test set)" % (
    len(data_test.data), data_test_size_mb))
print("%d categories" % len(categories))
print()

# split a training set and a test set
y_train, y_test = data_train.target, data_test.target

Since this dataset was available with Scikit, Its labels etc were all built in. For my case, I know how to load the dataset (Line 84):

dataset = load_files('./TED_dataset/Topics/')

I have no idea what I should do after that. I want to know how I should split this data in training and testing and generate these labels from my dataset:

data_train.data,  data_test.data

All in all, I just want to load my dataset, run it on this code error free. I have uploaded the dataset here for those who might want to see it.

I have referred to this question which speaks briefly about test-train loading. I also want to know how data_train.target_names should be fetched from my dataset.

Edit:

I tried to get the train and test which returns error:

dataset = load_files('./TED_dataset/Topics/')
train, test = train_test_split(dataset, train_size = 0.8)

Updated code is here.

329

asked Nov 09 '15 15:11

FlyingAura

1 Answers

I think you are looking for something like this:

In [1]: from sklearn.datasets import load_files

In [2]: from sklearn.cross_validation import train_test_split

In [3]: bunch = load_files('./Topics')

In [4]: X_train, X_test, y_train, y_test = train_test_split(bunch.data, bunch.target, test_size=.4)

# Then proceed to train your model and validate.

Note that bunch.target is an array of integers which are the indices of the category names stored in bunch.target_names.

In [14]: X_test[:2]
Out[14]:
['Psychologist Philip Zimbardo asks, "Why are boys struggling?" He shares some stats (lower graduation rates, greater worries about intimacy and relationships) and suggests a few reasons -- and challenges the TED community to think about solutions.Philip Zimbardo was the leader of the notorious 1971 Stanford Prison Experiment -- and an expert witness at Abu Ghraib. His book The Lucifer Effect explores the nature of evil; now, in his new work, he studies the nature of heroism.',
 'Human growth has strained the Earth\'s resources, but as Johan Rockstrom reminds us, our advances also give us the science to recognize this and change behavior. His research has found nine "planetary boundaries" that can guide us in protecting our planet\'s many overlapping ecosystems.If Earth is a self-regulating system, it\'s clear that human activity is capable of disrupting it. Johan Rockstrom has led a team of scientists to define the nine Earth systems that need to be kept within bounds for Earth to keep itself in balance.']

In [15]: y_test[:2]
Out[15]: array([ 84, 113])

In [16]: [bunch.target_names[idx] for idx in y_test[:2]]
Out[16]: ['Education', 'Global issues']

answered Nov 14 '22 21:11

R. Max

Related questions
                            
                                Python 3.4 SSL error urlopen error EOF occurred in violation of protocol (_ssl.c:600)
                            
                                How to get arrow heads tip to start/end at specified coordinates in Python?
                            
                                Abstract base class: raise NotImplementedError() in `__init__.py`?
                            
                                Define name for column func.count in sqlalchemy
                            
                                How to set Bernoulli distribution parameters in pymc3
                            
                                Visual Studio Code - input function in Python
                            
                                Numpy: How to create a grid-like array?
                            
                                Push Notification in DRF
                            
                                OSCAR_SEARCH_FACETS for filtering product lists
                            
                                How to set the alpha value for matplotlib plots globally
                            
                                Numpy and matplotlib garbage collection
                            
                                how to make "python setup.py install" install source instead of egg file?
                            
                                Why does installation of some Python packages require Visual Studio?
                            
                                Convert from CMYK to RGB
                            
                                Update all pip packages that don't come from conda
                            
                                404 page not found using Django + react-router
                            
                                How to check if a MySQL connection is open in Python?
                            
                                Python partial equivalent in Javascript / jQuery
                            
                                Obtaining the first few rows of a dataframe
                            
                                Computing AUC and ROC curve from multi-class data in scikit-learn (sklearn)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Load Custom Dataset (which is like 20 news group set) in Scikit for Classification of text documents

Tags:

python

machine-learning

dataset

nlp

scikit-learn

FlyingAura

People also ask

1 Answers

R. Max

Recent Activity

Donate For Us