I’m currently trying to build some text classification tools using Python and Scikit-learn.
My text is not in English and so, isn’t subject to the usual treatment of stems decomposing or other English-based Dimensionality reduction.
As a result, the TfIdf matrix gets pretty big (150,000x150,000) It can be processed using regular PC, but running a grid search on them would be too much, so I enlisted the help of Amazon Web Service to run the grid search. (My parameter set is pretty big, too)
Here is my code:
# coding: utf-8
import os, json, codecs, nltk
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer,TfidfTransformer
from sklearn.grid_search import GridSearchCV
from time import time
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
print("Importing dataset...")
with open('y_data.json','r') as fp:
y = json.load(fp)
with open('dataset.json','r') as fp:
dataset = json.load(fp)
print("Importing stop words...")
with codecs.open('stopword.txt','r','utf-8') as fp:
stopword = []
for w in fp:
stopword.append(w.strip())
light_st = set(stopword)
with codecs.open('st_data.txt','r','cp874') as fp:
for w in fp:
stopword.append(w.strip())
heavy_st = set(stopword)
def pre_process_1(text):
return text.replace("|"," ")
def tokenize_1(text):
return text.split()
pipeline = Pipeline([('vec', CountVectorizer(encoding='cp874', preprocessor=pre_process_1, tokenizer=tokenize_1, stop_words=heavy_st, token_pattern=None)),('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])
parameters = {
'vec__max_df': (0.5, 0.625, 0.75, 0.875, 1.0),
'vec__max_features': (None, 5000, 10000, 20000),
'vec__min_df': (1, 5, 10, 20, 50),
'tfidf__use_idf': (True, False),
'tfidf__sublinear_tf': (True, False),
'vec__binary': (True, False),
'tfidf__norm': ('l1', 'l2'),
'clf__alpha': (1, 0.1, 0.01, 0.001, 0.0001, 0.00001)
}
if __name__ == "__main__":
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=2)
t0 = time()
grid_search.fit(dataset, y)
print("done in {0}s".format(time() - t0))
print("Best score: {0}".format(grid_search.best_score_))
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(list(parameters.keys())):
print("\t{0}: {1}".format(param_name, best_parameters[param_name]))
And here is the details of my software environment:
Python3.4.2
scikit-learn 0.15.2 (installed with Pip)
Ubuntu Server14.04 LTS, 64 bit (using HVM)
Tried on ec2 r3.8xlarge instance
At first, I run my model using a much smaller instance (r3.2xlarge; 8 cores) but found out from calculation that it would take quite a long time (2 days). So, I decided to scale up my machine and use the largest instance (I use r3 because my script was quite memory intensive); however, it didn't process as fast as I assumed.
When I tried to monitor the CPU load (watch -n 5 uptime)... I found out that the average CPU load never exceeds 9 even when I leave it running for some time. (From what I understand, a thirty-two core machine, when fully utilizes all of its cores, should be around 32).
I tried changing
n_job
parameter to various numbers (8, 32, 128) with the same result. (However I think that the script tries to run as many jobs as it was instructed because when I terminate the process I would see something like... "Process ForkPoolWorker-30:" and their tracebacks whips past the screen)
Further inspection with the ps x -C python3.4 command yields that only 8 python processes are running. I deduced that this might be some kind of limitation from python or the OS (I build my AMI using t2.micro instance which doesn't have many cores) So, I decided to redo my work of rebuilt my environment from scratch including compiling Python using c3.4xlarge, and change the OS to Amazon Linux (a branch of Fedora I think) for the better compatibility with the hardware.
However, my script still never exceeded 8 cores. Finally, using demo text classification code from the Scikit-learn website: http://scikit-learn.org/stable/auto_examples/grid_search_text_feature_extraction.html (which use SGDClassifier instead of MultinomialNB) It can run perfectly with all thirty-two cores!
So... maybe, something to do with the grid search algorithm and Naive Bayes classifier?
I’m considering filing a bug, but would like to know first if this is an expected behavior of Naive Bayes or if I’m doing something wrong with my code?
I can't find a way to test if the memory bandwidth has been the culprit directly. But I try to time my parallel code and CPU usage in various ways to find where, indeed, the bottleneck occurs.
Using my real data as input (150,000 text documents; each contains roughly 130 words)
The parameters space is about 400.
Multi-threading is done by Joblib (the same module used by Scikit-learn). I got:
Using 8 threads: done in 841.017783164978 s and use 24.636999999999993 % of CPU.
Using 16 threads: done in 842.9525656700134 s and use 24.700749999999985 % of CPU.
Using all 32 threads: done in 857.024197101593 s and use 24.242250000000013 % of CPU.
The result clearly indicates that the vectorization process is unable to scale as the processing power increases.
Using the parameters space of about 400 as before, I got:
Using 8 threads: done in 2102.0565922260284 s and use 25.486000000000054 % of CPU.
Using 16 threads: done in 1385.6887295246124 s and use 49.83674999999993 % of CPU.
Using all 32 threads: done in 1319.416403055191 s and use 89.90074999999997 % of CPU.
The transition from 8 threads to 16 threads shows a huge improvement. However, as the number of threads increases to 32, the total time of completion gets only slightly shorter while the CPU usage increases substantially. This point I don't quite understand.
Using 8 threads: done in 3385.3253166675568 s and use 25.68999999999995 % of CPU.
Using 16 threads: done in 2066.499200105667 s and use 49.359249999999996 % of CPU.
Using all 32 threads: done in 2018.8800330162048 s and use 54.55375000000004 % of CPU.
There are some discrepancy between the time I've got from my own parallel code and that of the GridsearchCV, but this could be because the simplification I've done in my code (I'm not doing cross validation or full parameters iteration like in Gridsearch)
From my test, I conclude that. (And please correct me if I’m wrong)
I think my best bet now is to go for the cluster computing. :)
GridSearchCV tries all the combinations of the values passed in the dictionary and evaluates the model for each combination using the Cross-Validation method. Hence after using this function we get accuracy/loss for every combination of hyperparameters and we can choose the one with the best performance.
Two generic approaches to parameter search are provided in scikit-learn: for given values, GridSearchCV exhaustively considers all parameter combinations, while RandomizedSearchCV can sample a given number of candidates from a parameter space with a specified distribution.
After creating our grid we can run our GridSearchCV model passing RandomForestClassifier() to our estimator parameter, our grid to the param_grid parameter, and a cross validation fold value of 5. We can now use the “. best_params_” method of the model to output the best parameters for our model.
Brief Overview of Grid Search 1 — Prepare the database. 2 —Identify the model's hyperparameters to optimize, and then we select the hyperparameter values that we want to test. 3 — Asses error score for each combination in the hyperparameter grid. 4 — Select the hyperparameter combination with the best error metric.
Looks like your jobs are all memory-bound.
Naive Bayes is an extremely simple model, and its training algorithms consists of a single (sparse) matrix multiplication and a few sums. Similarly, tf-idf is very simple to compute: it sums over its inputs, computes a few logs, and stores the result.
In fact, NB is so simple that the bottleneck of this program is almost certainly in CountVectorizer
, which transforms in-memory data structures several times until it gets all its term counts crammed into the right matrix format. You're likely to hit a memory bandwidth bottleneck if you do a lot of that in parallel.
(This is all educated guesswork, but it's based on my involvement with in scikit-learn development. I'm one of the authors of MultinomialNB
and one of many people who have hacked on CountVectorizer
to speed it up.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With