Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallel jobs don't finish in scikit-learn's GridSearchCV

In the following script, I'm finding that the jobs launched by GridSearchCV seem to hang.

import json
import pandas as pd
import numpy as np
import unicodedata
import re
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import SGDClassifier
import sklearn.cross_validation as CV
from sklearn.grid_search import GridSearchCV
from nltk.stem import WordNetLemmatizer

# Seed for randomization. Set to some definite integer for debugging and set to None for production
seed = None


### Text processing functions ###

def normalize(string):#Remove diacritics and whatevs
    return "".join(ch.lower() for ch in unicodedata.normalize('NFD', string) if not unicodedata.combining(ch))

wnl = WordNetLemmatizer()
def tokenize(string):#Ignores special characters and punct
    return [wnl.lemmatize(token) for token in re.compile('\w\w+').findall(string)]

def ngrammer(tokens):#Gets all grams in each ingredient
    max_n = 2
    return [":".join(tokens[idx:idx+n]) for n in np.arange(1,1 + min(max_n,len(tokens))) for idx in range(len(tokens) + 1 - n)]

print("Importing training data...")
with open('/Users/josh/dev/kaggle/whats-cooking/data/train.json','rt') as file:
    recipes_train_json = json.load(file)

# Build the grams for the training data
print('\nBuilding n-grams from input data...')
for recipe in recipes_train_json:
    recipe['grams'] = [term for ingredient in recipe['ingredients'] for term in ngrammer(tokenize(normalize(ingredient)))]

# Build vocabulary from training data grams. 
vocabulary = list({gram for recipe in recipes_train_json for gram in recipe['grams']})

# Stuff everything into a dataframe. 
ids_index = pd.Index([recipe['id'] for recipe in recipes_train_json],name='id')
recipes_train = pd.DataFrame([{'cuisine': recipe['cuisine'], 'ingredients': " ".join(recipe['grams'])} for recipe in recipes_train_json],columns=['cuisine','ingredients'], index=ids_index)


# Extract data for fitting
fit_data = recipes_train['ingredients'].values
fit_target = recipes_train['cuisine'].values

# extracting numerical features from the ingredient text
feature_ext = Pipeline([('vect', CountVectorizer(vocabulary=vocabulary)),
                        ('tfidf', TfidfTransformer(use_idf=True)),
                        ('svd', TruncatedSVD(n_components=1000))
])
lsa_fit_data = feature_ext.fit_transform(fit_data)

# Build SGD Classifier
clf =  SGDClassifier(random_state=seed)
# Hyperparameter grid for GRidSearchCV. 
parameters = {
    'alpha': np.logspace(-6,-2,5),
}

# Init GridSearchCV with k-fold CV object
cv = CV.KFold(lsa_fit_data.shape[0], n_folds=3, shuffle=True, random_state=seed)
gs_clf = GridSearchCV(
    estimator=clf,
    param_grid=parameters,
    n_jobs=-1,
    cv=cv,
    scoring='accuracy',
    verbose=2    
)
# Fit on training data
print("\nPerforming grid search over hyperparameters...")
gs_clf.fit(lsa_fit_data, fit_target)

The console output is:

Importing training data...

Building n-grams from input data...

Performing grid search over hyperparameters...
Fitting 3 folds for each of 5 candidates, totalling 15 fits
[CV] alpha=1e-06 .....................................................
[CV] alpha=1e-06 .....................................................
[CV] alpha=1e-06 .....................................................
[CV] alpha=1e-05 .....................................................
[CV] alpha=1e-05 .....................................................
[CV] alpha=1e-05 .....................................................
[CV] alpha=0.0001 ....................................................
[CV] alpha=0.0001 .................................................... 

And then it just hangs. If I set n_jobs=1 in GridSearchCV, then the script completes as expected with output:

Importing training data...

Building n-grams from input data...

Performing grid search over hyperparameters...
Fitting 3 folds for each of 5 candidates, totalling 15 fits
[CV] alpha=1e-06 .....................................................
[CV] ............................................ alpha=1e-06 -   6.5s
[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    6.6s
[CV] alpha=1e-06 .....................................................
[CV] ............................................ alpha=1e-06 -   6.6s
[CV] alpha=1e-06 .....................................................
[CV] ............................................ alpha=1e-06 -   6.7s
[CV] alpha=1e-05 .....................................................
[CV] ............................................ alpha=1e-05 -   6.7s
[CV] alpha=1e-05 .....................................................
[CV] ............................................ alpha=1e-05 -   6.7s
[CV] alpha=1e-05 .....................................................
[CV] ............................................ alpha=1e-05 -   6.6s
[CV] alpha=0.0001 ....................................................
[CV] ........................................... alpha=0.0001 -   6.6s
[CV] alpha=0.0001 ....................................................
[CV] ........................................... alpha=0.0001 -   6.7s
[CV] alpha=0.0001 ....................................................
[CV] ........................................... alpha=0.0001 -   6.7s
[CV] alpha=0.001 .....................................................
[CV] ............................................ alpha=0.001 -   7.0s
[CV] alpha=0.001 .....................................................
[CV] ............................................ alpha=0.001 -   6.8s
[CV] alpha=0.001 .....................................................
[CV] ............................................ alpha=0.001 -   6.6s
[CV] alpha=0.01 ......................................................
[CV] ............................................. alpha=0.01 -   6.7s
[CV] alpha=0.01 ......................................................
[CV] ............................................. alpha=0.01 -   7.3s
[CV] alpha=0.01 ......................................................
[CV] ............................................. alpha=0.01 -   7.1s
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:  1.7min finished

The single-threaded execution finishes pretty quickly so I'm sure I'm giving the parallel job case enough time to do the calculation itself.

Environment specs: MacBook Pro (15-inch, Mid 2010), 2.4 GHz Intel Core i5, 8 GB 1067 MHz DDR3, OSX 10.10.5, python 3.4.3, ipython 3.2.0, numpy v1.9.3, scipy 0.16.0, scikit-learn v0.16.1 (python and packages all from anaconda distro)

Some additional comments:

I use n_jobs=-1 with GridSearchCV all the time on this machine without issue, so my platform does support the functionality. It usually has 4 jobs out a time, as I've got 4 cores on this machine (2 physical, but 4 "virtual cores" due to Mac hyperthreading). But unless I misunderstand the console output, in this case it has 8 jobs out without any returning. Watching CPU usage in Activity Monitor in real time, 4 jobs launch, work a bit, then finish (or die?) followed by 4 more that launch, work a bit, and then go completely idle but stick around.

At no point do I see significant memory pressure. The main process tops at about 1GB real mem, the child processes at around 600MB. By the time they hang, real memory is negligible.

The script works fine with multiple jobs if one removes the TruncatedSVD step from the feature extraction pipeline. Note, though, that this pipeline acts before the grid search and is not part of the GridSearchCV job(s).

This script is for the kaggle competition What's Cooking? so if you want to try run it on the same data I'm using, you can grab it from there. The data comes as a JSON array of objects. Each object represents a recipe and contains a list of text snippets which are the ingredients. Since each sample is a collection of documents instead of a single document, I ended up having to write some of my own n-gramming and tokenization logic since I couldn't figure out how to get the built-in transformers of scikit-learn to do exactly what I want. I doubt any of that matters but just an FYI.

I usually run scripts within the iPython CLI with %run, but I get the same behavior running it from the OSX bash terminal with python (3.4.3) directly.

like image 338
josh314 Avatar asked Oct 09 '15 15:10

josh314


People also ask

How much time does Gridsearch cv take?

Let's say for example I have 4 processors available, each processor should fit the model 180/4=45 times. Now, if on average my model takes 10sec to train, I'm estimating around 45⋅10/60=7.5min training time.

What is GridSearchCV Best_score_?

The grid. best_score_ is the average of all cv folds for a single combination of the parameters you specify in the tuned_params . In order to access other relevant details about the grid searching process, you can look at the grid.

Does Gridsearch cv do cross-validation?

In GridSearchCV, along with Grid Search, cross-validation is also performed. Cross-Validation is used while training the model. As we know that before training the model with data, we divide the data into two parts – train data and test data.

What is N_jobs in grid search?

n_jobs: number of processes you wish to run in parallel for this task if it -1 it will use all available processors.


1 Answers

This might be an issue with multiprocessing used by GridSearchCV if njob>1. So rather than using multiprocessing, you can try multithreading to see if it works fine.

from sklearn.externals.joblib import parallel_backend

clf = GridSearchCV(...)
with parallel_backend('threading'):
    clf.fit(x_train, y_train)

I was having the same issue with my estimator using GSV with njob >1 and using this works great across njob values.

PS: I am not sure if "threading" would have same advantages as "multiprocessing" for all estimators. But theoretically, "threading" would not be a great choice if your estimator is limited by GIL but if the estimator is a cython/numpy based it would be better than "multiprocessing"

System tried on:

MAC OS: 10.12.6
Python: 3.6
numpy==1.13.3
pandas==0.21.0
scikit-learn==0.19.1
like image 99
Trideep Rath Avatar answered Oct 13 '22 20:10

Trideep Rath