Using Sci-Kit learn to classify text with a large corpus

Tags:

I have about 1600 articles in my database, with each article already pre-labeled with one of the following categories:

Technology
Science
Business
World
Health
Entertainment
Sports

I am trying to use sci-kit learn to build a classifier that would categorize new articles. (I guess i'll split my training data in half, for training and testing?)

I am looking to use tf-idf, as I don't have a list of stop-words (I can use NLTK to extract only adjectives and nouns, though, but i'd rather give scikit-learn the full article).

I've read all of the documentation on scikit-learn, but their examples involve word-occurence and N-grams (which are fine), but they never specify how to tie a piece of data to a label.

I've tried looking at their sample code, but it's too confusing to follow.

Could someone help me with this, or point me in the right direction?

Thanks.

636

asked Oct 12 '13 16:10

TheProofIsTrivium

1 Answers

I think you faced the same problem I did when I started to feed my own data to the classifiers.

You can use the function sklearn.datasets.load_files, but to do so, you need to create this structure:

train
├── science
│   ├── 0001.txt
│   └── 0002.txt
└── technology
    ├── 0001.txt
    └── 0002.txt

Where the subdirectories of train are named as the labels, and each file within the labels directory is an article with that corresponding label. Then use load_files to load the data:

In [1]: from sklearn.datasets import load_files

In [2]: load_files('train')
Out[2]: 
{'DESCR': None,
 'data': ['iphone apple smartphone\n',
  'linux windows ubuntu\n',
  'biology astrophysics\n',
  'math\n'],
 'filenames': array(['train/technology/0001.txt', 'train/technology/0002.txt',
       'train/science/0002.txt', 'train/science/0001.txt'], 
      dtype='|S25'),
 'target': array([1, 1, 0, 0]),
 'target_names': ['science', 'technology']}

The object returned is a sklearn.datasets.base.Bunch, which is a simple data wrapper. This is a straightforward approach to start playing with the classifiers, but when your data is larger and change frequently, you might want to stop using files and use, for example, a database to store the labeled documents and maybe having more structure than just plain text. Basically you will need to generate your list of categories (or target_names) like ['science', 'technology', ...] and assign the target value for each document in the data list as the index of the labeled category in the target_names list. The length of data and target must be the same.

You can take a look to this script that I wrote time ago to run a classifier: https://github.com/darkrho/yatiri/blob/master/scripts/run_classifier.py#L267

145

answered Oct 05 '22 21:10

R. Max

Related questions
                            
                                nose framework command line regex pattern matching doesnt work(-e,-m ,-i)
                            
                                How can I create functions that handle polynomials?
                            
                                What's the fastest way to merge multiple csv files by column?
                            
                                Tastypie migration error
                            
                                Exception TypeError warning sometimes shown, sometimes not when using throw method of generator
                            
                                Why "None" has the same effect as "np.newaxis"? [duplicate]
                            
                                Vim obtain string between visual selection range with Python
                            
                                Is there a way to check if a module is being loaded by multiprocessing standard module in Windows?
                            
                                fnmatch and recursive path match with `**`
                            
                                Refer to multiple Models in View/Template in Django
                            
                                does scikit-lean decision tree support unordered ('enum') multiclass features?
                            
                                Replacing element in list without list comprehension, slicing or using [ ]s
                            
                                Name 'x' is parameter and global [Python]
                            
                                AttributeError in python/numpy when constructing function for certain values
                            
                                What's wrong with Pandas plot?
                            
                                How can I count the occurrences of an item in a list of dictionaries?
                            
                                GridSearchCV on LogisticRegression in scikit-learn
                            
                                Trying to run KIVY, for the first time
                            
                                'ascii' codec can't encode character u'\u2013' in position 9: ordinal not in range(128)
                            
                                Does "for key in dict" in python always iterate in a fixed order?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using Sci-Kit learn to classify text with a large corpus

Tags:

python

classification

scikit-learn

TheProofIsTrivium

People also ask

1 Answers

R. Max

Recent Activity

Donate For Us