Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Sci-Kit learn to classify text with a large corpus

I have about 1600 articles in my database, with each article already pre-labeled with one of the following categories:

Technology
Science
Business
World
Health
Entertainment
Sports

I am trying to use sci-kit learn to build a classifier that would categorize new articles. (I guess i'll split my training data in half, for training and testing?)

I am looking to use tf-idf, as I don't have a list of stop-words (I can use NLTK to extract only adjectives and nouns, though, but i'd rather give scikit-learn the full article).

I've read all of the documentation on scikit-learn, but their examples involve word-occurence and N-grams (which are fine), but they never specify how to tie a piece of data to a label.

I've tried looking at their sample code, but it's too confusing to follow.

Could someone help me with this, or point me in the right direction?

Thanks.

like image 636
TheProofIsTrivium Avatar asked Oct 12 '13 16:10

TheProofIsTrivium


People also ask

Which deep learning model is best for text classification?

The two main deep learning architectures for text classification are Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN).

What is the best method for text classification?

Linear Support Vector Machine is widely regarded as one of the best text classification algorithms.


1 Answers

I think you faced the same problem I did when I started to feed my own data to the classifiers.

You can use the function sklearn.datasets.load_files, but to do so, you need to create this structure:

train
├── science
│   ├── 0001.txt
│   └── 0002.txt
└── technology
    ├── 0001.txt
    └── 0002.txt

Where the subdirectories of train are named as the labels, and each file within the labels directory is an article with that corresponding label. Then use load_files to load the data:

In [1]: from sklearn.datasets import load_files

In [2]: load_files('train')
Out[2]: 
{'DESCR': None,
 'data': ['iphone apple smartphone\n',
  'linux windows ubuntu\n',
  'biology astrophysics\n',
  'math\n'],
 'filenames': array(['train/technology/0001.txt', 'train/technology/0002.txt',
       'train/science/0002.txt', 'train/science/0001.txt'], 
      dtype='|S25'),
 'target': array([1, 1, 0, 0]),
 'target_names': ['science', 'technology']}

The object returned is a sklearn.datasets.base.Bunch, which is a simple data wrapper. This is a straightforward approach to start playing with the classifiers, but when your data is larger and change frequently, you might want to stop using files and use, for example, a database to store the labeled documents and maybe having more structure than just plain text. Basically you will need to generate your list of categories (or target_names) like ['science', 'technology', ...] and assign the target value for each document in the data list as the index of the labeled category in the target_names list. The length of data and target must be the same.

You can take a look to this script that I wrote time ago to run a classifier: https://github.com/darkrho/yatiri/blob/master/scripts/run_classifier.py#L267

like image 145
R. Max Avatar answered Oct 05 '22 21:10

R. Max