I have about 1600 articles in my database, with each article already pre-labeled with one of the following categories:
Technology
Science
Business
World
Health
Entertainment
Sports
I am trying to use sci-kit learn to build a classifier that would categorize new articles. (I guess i'll split my training data in half, for training and testing?)
I am looking to use tf-idf, as I don't have a list of stop-words (I can use NLTK to extract only adjectives and nouns, though, but i'd rather give scikit-learn the full article).
I've read all of the documentation on scikit-learn, but their examples involve word-occurence and N-grams (which are fine), but they never specify how to tie a piece of data to a label.
I've tried looking at their sample code, but it's too confusing to follow.
Could someone help me with this, or point me in the right direction?
Thanks.
The two main deep learning architectures for text classification are Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN).
Linear Support Vector Machine is widely regarded as one of the best text classification algorithms.
I think you faced the same problem I did when I started to feed my own data to the classifiers.
You can use the function sklearn.datasets.load_files
, but to do so, you need to create this structure:
train
├── science
│ ├── 0001.txt
│ └── 0002.txt
└── technology
├── 0001.txt
└── 0002.txt
Where the subdirectories of train
are named as the labels, and each file within the labels directory is an article with that corresponding label. Then use load_files
to load the data:
In [1]: from sklearn.datasets import load_files
In [2]: load_files('train')
Out[2]:
{'DESCR': None,
'data': ['iphone apple smartphone\n',
'linux windows ubuntu\n',
'biology astrophysics\n',
'math\n'],
'filenames': array(['train/technology/0001.txt', 'train/technology/0002.txt',
'train/science/0002.txt', 'train/science/0001.txt'],
dtype='|S25'),
'target': array([1, 1, 0, 0]),
'target_names': ['science', 'technology']}
The object returned is a sklearn.datasets.base.Bunch
, which is a simple data wrapper. This is a straightforward approach to start playing with the classifiers, but when your data is larger and change frequently, you might want to stop using files and use, for example, a database to store the labeled documents and maybe having more structure than just plain text. Basically you will need to generate your list of categories (or target_names
) like ['science', 'technology', ...]
and assign the target
value for each document in the data
list as the index of the labeled category in the target_names
list. The length of data
and target
must be the same.
You can take a look to this script that I wrote time ago to run a classifier: https://github.com/darkrho/yatiri/blob/master/scripts/run_classifier.py#L267
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With