Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performing grid search on sklearn.naive_bayes.MultinomialNB on multi-core machine doesn’t use all the available CPU resources

I’m currently trying to build some text classification tools using Python and Scikit-learn.

My text is not in English and so, isn’t subject to the usual treatment of stems decomposing or other English-based Dimensionality reduction.

As a result, the TfIdf matrix gets pretty big (150,000x150,000) It can be processed using regular PC, but running a grid search on them would be too much, so I enlisted the help of Amazon Web Service to run the grid search. (My parameter set is pretty big, too)

Here is my code:

 # coding: utf-8  
    import os, json, codecs, nltk  
    import numpy as np  
    from sklearn.feature_extraction.text import TfidfVectorizer,  CountVectorizer,TfidfTransformer  
    from sklearn.grid_search import GridSearchCV  
    from time import time  
    from sklearn.pipeline import Pipeline  
    from sklearn.naive_bayes import MultinomialNB  
    print("Importing dataset...")  
    with open('y_data.json','r') as fp:  
        y = json.load(fp)  
    with open('dataset.json','r') as fp:  
        dataset = json.load(fp)  
    print("Importing stop words...")  
    with codecs.open('stopword.txt','r','utf-8') as fp:  
    stopword = []  
    for w in fp:  
        stopword.append(w.strip())  
    light_st = set(stopword)  
    with codecs.open('st_data.txt','r','cp874') as fp:  
    for w in fp:  
        stopword.append(w.strip())  
    heavy_st = set(stopword)  
    def pre_process_1(text):  
        return text.replace("|"," ")  
    def tokenize_1(text):  
        return text.split()  
    pipeline = Pipeline([('vec', CountVectorizer(encoding='cp874', preprocessor=pre_process_1, tokenizer=tokenize_1, stop_words=heavy_st, token_pattern=None)),('tfidf', TfidfTransformer()), ('clf',       MultinomialNB())])
    parameters = {  
    'vec__max_df': (0.5, 0.625, 0.75, 0.875, 1.0),  
    'vec__max_features': (None, 5000, 10000, 20000),  
    'vec__min_df': (1, 5, 10, 20, 50),  
    'tfidf__use_idf': (True, False),  
    'tfidf__sublinear_tf': (True, False),  
    'vec__binary': (True, False),  
    'tfidf__norm': ('l1', 'l2'),  
    'clf__alpha': (1, 0.1, 0.01, 0.001, 0.0001, 0.00001)  
    }  
    if __name__ == "__main__":  
        grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=2)  
        t0 = time()  
        grid_search.fit(dataset, y)  
        print("done in {0}s".format(time() - t0))  
        print("Best score: {0}".format(grid_search.best_score_))  
        print("Best parameters set:")  
        best_parameters = grid_search.best_estimator_.get_params()  
        for param_name in sorted(list(parameters.keys())):  
            print("\t{0}: {1}".format(param_name, best_parameters[param_name]))

And here is the details of my software environment:

  • Python3.4.2

  • scikit-learn 0.15.2 (installed with Pip)

  • Ubuntu Server14.04 LTS, 64 bit (using HVM)

  • Tried on ec2 r3.8xlarge instance

At first, I run my model using a much smaller instance (r3.2xlarge; 8 cores) but found out from calculation that it would take quite a long time (2 days). So, I decided to scale up my machine and use the largest instance (I use r3 because my script was quite memory intensive); however, it didn't process as fast as I assumed.

When I tried to monitor the CPU load (watch -n 5 uptime)... I found out that the average CPU load never exceeds 9 even when I leave it running for some time. (From what I understand, a thirty-two core machine, when fully utilizes all of its cores, should be around 32).

I tried changing

n_job

parameter to various numbers (8, 32, 128) with the same result. (However I think that the script tries to run as many jobs as it was instructed because when I terminate the process I would see something like... "Process ForkPoolWorker-30:" and their tracebacks whips past the screen)

Further inspection with the ps x -C python3.4 command yields that only 8 python processes are running. I deduced that this might be some kind of limitation from python or the OS (I build my AMI using t2.micro instance which doesn't have many cores) So, I decided to redo my work of rebuilt my environment from scratch including compiling Python using c3.4xlarge, and change the OS to Amazon Linux (a branch of Fedora I think) for the better compatibility with the hardware.

However, my script still never exceeded 8 cores. Finally, using demo text classification code from the Scikit-learn website: http://scikit-learn.org/stable/auto_examples/grid_search_text_feature_extraction.html (which use SGDClassifier instead of MultinomialNB) It can run perfectly with all thirty-two cores!

So... maybe, something to do with the grid search algorithm and Naive Bayes classifier?

I’m considering filing a bug, but would like to know first if this is an expected behavior of Naive Bayes or if I’m doing something wrong with my code?

Updates

I can't find a way to test if the memory bandwidth has been the culprit directly. But I try to time my parallel code and CPU usage in various ways to find where, indeed, the bottleneck occurs.

Experiment 1: perform only the vectorization and transformation.

Using my real data as input (150,000 text documents; each contains roughly 130 words)
The parameters space is about 400.
Multi-threading is done by Joblib (the same module used by Scikit-learn). I got:
Using 8 threads: done in 841.017783164978 s and use 24.636999999999993 % of CPU.
Using 16 threads: done in 842.9525656700134 s and use 24.700749999999985 % of CPU.
Using all 32 threads: done in 857.024197101593 s and use 24.242250000000013 % of CPU.

The result clearly indicates that the vectorization process is unable to scale as the processing power increases.

Experiment 2: This time I perform only the MultinomialNB on the pre-vectorized data.

Using the parameters space of about 400 as before, I got:
Using 8 threads: done in 2102.0565922260284 s and use 25.486000000000054 % of CPU.
Using 16 threads: done in 1385.6887295246124 s and use 49.83674999999993 % of CPU.
Using all 32 threads: done in 1319.416403055191 s and use 89.90074999999997 % of CPU.

The transition from 8 threads to 16 threads shows a huge improvement. However, as the number of threads increases to 32, the total time of completion gets only slightly shorter while the CPU usage increases substantially. This point I don't quite understand.

Experiment 3: I combine the two processes together.

Using 8 threads: done in 3385.3253166675568 s and use 25.68999999999995 % of CPU.
Using 16 threads: done in 2066.499200105667 s and use 49.359249999999996 % of CPU.
Using all 32 threads: done in 2018.8800330162048 s and use 54.55375000000004 % of CPU.

There are some discrepancy between the time I've got from my own parallel code and that of the GridsearchCV, but this could be because the simplification I've done in my code (I'm not doing cross validation or full parameters iteration like in Gridsearch)

conclusions

From my test, I conclude that. (And please correct me if I’m wrong)

  1. The vectorization phase use memory much more intensive; thus, most likely to saturate the bandwidth. This can be observe from the time of completion and CPU utilization that it hits some kind of bottleneck and didn't scale. However, it was a relatively quick process. (I eliminate IO-bound as all data is stored on RAM, and the memory usage during that time is about 30%)
  2. MultinomialNB uses memory less intensive than the vectorizer; most of the computation seems to be processed in-core. So, it can scale better than the vectorizer (8 > 16) but after that, it also hit some kind of bottle neck, and the MultinomialNB takes more time than the vectorizer.
  3. When combine two processes together, the completion time shows the same trend as MultinomialNB because, in my opinion, the memory bandwidth might be the bottleneck during the vectorization phase but that phase is relatively short compare to the MultinomialNB. So, if the number of concurrent tasks is low, it is possible to do those two phases together simultaneously and not saturate the bandwidth, but when the number of processes is high enough, there will be sufficient number of concurrent processes performing the vectorization to saturate the bandwidth; thereby, forcing the operating system to reduce the running process. (Explain only 8-9 running python process I found earlier)
  4. I'm not very sure but I think that the reason why SGDClassifier can use 100% CPU is because SGDClassifier has much longer in-core processing time than the MultinomialNB. So, in each iteration, much of the time is devoted to compute SGDClassifier in-core rather than doing vectorization, and the fact that SGDClassifier takes a long time to compute reduce the chance of many workers getting to the vectorization phase simultaneously (as each vectorization task is relatively short but memory intensive)

I think my best bet now is to go for the cluster computing. :)

like image 346
PawinP Avatar asked Oct 26 '14 03:10

PawinP


People also ask

How does Sklearn GridSearchCV work?

GridSearchCV tries all the combinations of the values passed in the dictionary and evaluates the model for each combination using the Cross-Validation method. Hence after using this function we get accuracy/loss for every combination of hyperparameters and we can choose the one with the best performance.

Which of the following approach to parameter search are provided in Scikit learn select all that apply?

Two generic approaches to parameter search are provided in scikit-learn: for given values, GridSearchCV exhaustively considers all parameter combinations, while RandomizedSearchCV can sample a given number of candidates from a parameter space with a specified distribution.

How do I cross validate with GridSearchCV?

After creating our grid we can run our GridSearchCV model passing RandomForestClassifier() to our estimator parameter, our grid to the param_grid parameter, and a cross validation fold value of 5. We can now use the “. best_params_” method of the model to output the best parameters for our model.

How do you use the grid search in Python?

Brief Overview of Grid Search 1 — Prepare the database. 2 —Identify the model's hyperparameters to optimize, and then we select the hyperparameter values that we want to test. 3 — Asses error score for each combination in the hyperparameter grid. 4 — Select the hyperparameter combination with the best error metric.


1 Answers

Looks like your jobs are all memory-bound.

Naive Bayes is an extremely simple model, and its training algorithms consists of a single (sparse) matrix multiplication and a few sums. Similarly, tf-idf is very simple to compute: it sums over its inputs, computes a few logs, and stores the result.

In fact, NB is so simple that the bottleneck of this program is almost certainly in CountVectorizer, which transforms in-memory data structures several times until it gets all its term counts crammed into the right matrix format. You're likely to hit a memory bandwidth bottleneck if you do a lot of that in parallel.

(This is all educated guesswork, but it's based on my involvement with in scikit-learn development. I'm one of the authors of MultinomialNB and one of many people who have hacked on CountVectorizer to speed it up.)

like image 105
Fred Foo Avatar answered Oct 29 '22 18:10

Fred Foo