Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Load Custom Dataset (which is like 20 news group set) in Scikit for Classification of text documents

I'm trying to run this scikit example code for my custom dataset of Ted Talks. Each Directory is a Topic under which are text files which contain the description for each Ted Talk.

This is how my datasets tree structure is. As you see, each directory is a topic and below it are text files which carry description.

Topics/
|-- Activism
|   |-- 1149.txt
|   |-- 1444.txt
|   |-- 157.txt
|   |-- 1616.txt
|   |-- 1706.txt
|   |-- 1718.txt
|-- Adventure
|   |-- 1036.txt
|   |-- 1777.txt
|   |-- 2930.txt
|   |-- 2968.txt
|   |-- 3027.txt
|   |-- 3290.txt
|-- Advertising
|   |-- 3673.txt
|   |-- 3685.txt
|   |-- 6567.txt
|   `-- 6925.txt
|-- Africa
|   |-- 1045.txt
|   |-- 1072.txt
|   |-- 1103.txt
|   |-- 1112.txt
|-- Aging
|   |-- 1848.txt
|   |-- 2495.txt
|   |-- 2782.txt
|-- Agriculture
|   |-- 3469.txt
|   |-- 4140.txt
|   |-- 4733.txt
|   |-- 4939.txt

I have made my dataset in such form to resemble the 20news group whose tree structure is such:

20news-18828/
|-- alt.atheism
|   |-- 49960
|   |-- 51060
|   |-- 51119

|-- comp.graphics
|   |-- 37261
|   |-- 37913
|   |-- 37914
|   |-- 37915
|   |-- 37916
|   |-- 37917
|   |-- 37918
|-- comp.os.ms-windows.misc
|   |-- 10000
|   |-- 10001
|   |-- 10002
|   |-- 10003
|   |-- 10004
|   |-- 10005 

In the original code (98-124), This is how training and testing data is loaded directly from scikit.

print("Loading 20 newsgroups dataset for categories:")
print(categories if categories else "all")

data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=remove)

data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=remove)
print('data loaded')

categories = data_train.target_names    # for case categories == None
def size_mb(docs):
    return sum(len(s.encode('utf-8')) for s in docs) / 1e6

data_train_size_mb = size_mb(data_train.data)
data_test_size_mb = size_mb(data_test.data)

print("%d documents - %0.3fMB (training set)" % (
    len(data_train.data), data_train_size_mb))
print("%d documents - %0.3fMB (test set)" % (
    len(data_test.data), data_test_size_mb))
print("%d categories" % len(categories))
print()

# split a training set and a test set
y_train, y_test = data_train.target, data_test.target

Since this dataset was available with Scikit, Its labels etc were all built in. For my case, I know how to load the dataset (Line 84):

dataset = load_files('./TED_dataset/Topics/')

I have no idea what I should do after that. I want to know how I should split this data in training and testing and generate these labels from my dataset:

data_train.data,  data_test.data 

All in all, I just want to load my dataset, run it on this code error free. I have uploaded the dataset here for those who might want to see it.

I have referred to this question which speaks briefly about test-train loading. I also want to know how data_train.target_names should be fetched from my dataset.

Edit:

I tried to get the train and test which returns error:

dataset = load_files('./TED_dataset/Topics/')
train, test = train_test_split(dataset, train_size = 0.8)

Updated code is here.

like image 329
FlyingAura Avatar asked Nov 09 '15 15:11

FlyingAura


People also ask

What is classification in Scikit-learn?

In scikit-learn, an estimator for classification is a Python object that implements the methods fit(X, y) and predict(T) . An example of an estimator is the class sklearn. svm. SVC , which implements support vector classification.


1 Answers

I think you are looking for something like this:

In [1]: from sklearn.datasets import load_files

In [2]: from sklearn.cross_validation import train_test_split

In [3]: bunch = load_files('./Topics')

In [4]: X_train, X_test, y_train, y_test = train_test_split(bunch.data, bunch.target, test_size=.4)

# Then proceed to train your model and validate.

Note that bunch.target is an array of integers which are the indices of the category names stored in bunch.target_names.

In [14]: X_test[:2]
Out[14]:
['Psychologist Philip Zimbardo asks, "Why are boys struggling?" He shares some stats (lower graduation rates, greater worries about intimacy and relationships) and suggests a few reasons -- and challenges the TED community to think about solutions.Philip Zimbardo was the leader of the notorious 1971 Stanford Prison Experiment -- and an expert witness at Abu Ghraib. His book The Lucifer Effect explores the nature of evil; now, in his new work, he studies the nature of heroism.',
 'Human growth has strained the Earth\'s resources, but as Johan Rockstrom reminds us, our advances also give us the science to recognize this and change behavior. His research has found nine "planetary boundaries" that can guide us in protecting our planet\'s many overlapping ecosystems.If Earth is a self-regulating system, it\'s clear that human activity is capable of disrupting it. Johan Rockstrom has led a team of scientists to define the nine Earth systems that need to be kept within bounds for Earth to keep itself in balance.']

In [15]: y_test[:2]
Out[15]: array([ 84, 113])

In [16]: [bunch.target_names[idx] for idx in y_test[:2]]
Out[16]: ['Education', 'Global issues']
like image 51
R. Max Avatar answered Nov 14 '22 21:11

R. Max