Parallel jobs don't finish in scikit-learn's GridSearchCV

Tags:

In the following script, I'm finding that the jobs launched by GridSearchCV seem to hang.

import json
import pandas as pd
import numpy as np
import unicodedata
import re
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import SGDClassifier
import sklearn.cross_validation as CV
from sklearn.grid_search import GridSearchCV
from nltk.stem import WordNetLemmatizer

# Seed for randomization. Set to some definite integer for debugging and set to None for production
seed = None


### Text processing functions ###

def normalize(string):#Remove diacritics and whatevs
    return "".join(ch.lower() for ch in unicodedata.normalize('NFD', string) if not unicodedata.combining(ch))

wnl = WordNetLemmatizer()
def tokenize(string):#Ignores special characters and punct
    return [wnl.lemmatize(token) for token in re.compile('\w\w+').findall(string)]

def ngrammer(tokens):#Gets all grams in each ingredient
    max_n = 2
    return [":".join(tokens[idx:idx+n]) for n in np.arange(1,1 + min(max_n,len(tokens))) for idx in range(len(tokens) + 1 - n)]

print("Importing training data...")
with open('/Users/josh/dev/kaggle/whats-cooking/data/train.json','rt') as file:
    recipes_train_json = json.load(file)

# Build the grams for the training data
print('\nBuilding n-grams from input data...')
for recipe in recipes_train_json:
    recipe['grams'] = [term for ingredient in recipe['ingredients'] for term in ngrammer(tokenize(normalize(ingredient)))]

# Build vocabulary from training data grams. 
vocabulary = list({gram for recipe in recipes_train_json for gram in recipe['grams']})

# Stuff everything into a dataframe. 
ids_index = pd.Index([recipe['id'] for recipe in recipes_train_json],name='id')
recipes_train = pd.DataFrame([{'cuisine': recipe['cuisine'], 'ingredients': " ".join(recipe['grams'])} for recipe in recipes_train_json],columns=['cuisine','ingredients'], index=ids_index)


# Extract data for fitting
fit_data = recipes_train['ingredients'].values
fit_target = recipes_train['cuisine'].values

# extracting numerical features from the ingredient text
feature_ext = Pipeline([('vect', CountVectorizer(vocabulary=vocabulary)),
                        ('tfidf', TfidfTransformer(use_idf=True)),
                        ('svd', TruncatedSVD(n_components=1000))
])
lsa_fit_data = feature_ext.fit_transform(fit_data)

# Build SGD Classifier
clf =  SGDClassifier(random_state=seed)
# Hyperparameter grid for GRidSearchCV. 
parameters = {
    'alpha': np.logspace(-6,-2,5),
}

# Init GridSearchCV with k-fold CV object
cv = CV.KFold(lsa_fit_data.shape[0], n_folds=3, shuffle=True, random_state=seed)
gs_clf = GridSearchCV(
    estimator=clf,
    param_grid=parameters,
    n_jobs=-1,
    cv=cv,
    scoring='accuracy',
    verbose=2    
)
# Fit on training data
print("\nPerforming grid search over hyperparameters...")
gs_clf.fit(lsa_fit_data, fit_target)

The console output is:

Importing training data...

Building n-grams from input data...

Performing grid search over hyperparameters...
Fitting 3 folds for each of 5 candidates, totalling 15 fits
[CV] alpha=1e-06 .....................................................
[CV] alpha=1e-06 .....................................................
[CV] alpha=1e-06 .....................................................
[CV] alpha=1e-05 .....................................................
[CV] alpha=1e-05 .....................................................
[CV] alpha=1e-05 .....................................................
[CV] alpha=0.0001 ....................................................
[CV] alpha=0.0001 ....................................................

And then it just hangs. If I set n_jobs=1 in GridSearchCV, then the script completes as expected with output:

Importing training data...

Building n-grams from input data...

Performing grid search over hyperparameters...
Fitting 3 folds for each of 5 candidates, totalling 15 fits
[CV] alpha=1e-06 .....................................................
[CV] ............................................ alpha=1e-06 -   6.5s
[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    6.6s
[CV] alpha=1e-06 .....................................................
[CV] ............................................ alpha=1e-06 -   6.6s
[CV] alpha=1e-06 .....................................................
[CV] ............................................ alpha=1e-06 -   6.7s
[CV] alpha=1e-05 .....................................................
[CV] ............................................ alpha=1e-05 -   6.7s
[CV] alpha=1e-05 .....................................................
[CV] ............................................ alpha=1e-05 -   6.7s
[CV] alpha=1e-05 .....................................................
[CV] ............................................ alpha=1e-05 -   6.6s
[CV] alpha=0.0001 ....................................................
[CV] ........................................... alpha=0.0001 -   6.6s
[CV] alpha=0.0001 ....................................................
[CV] ........................................... alpha=0.0001 -   6.7s
[CV] alpha=0.0001 ....................................................
[CV] ........................................... alpha=0.0001 -   6.7s
[CV] alpha=0.001 .....................................................
[CV] ............................................ alpha=0.001 -   7.0s
[CV] alpha=0.001 .....................................................
[CV] ............................................ alpha=0.001 -   6.8s
[CV] alpha=0.001 .....................................................
[CV] ............................................ alpha=0.001 -   6.6s
[CV] alpha=0.01 ......................................................
[CV] ............................................. alpha=0.01 -   6.7s
[CV] alpha=0.01 ......................................................
[CV] ............................................. alpha=0.01 -   7.3s
[CV] alpha=0.01 ......................................................
[CV] ............................................. alpha=0.01 -   7.1s
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:  1.7min finished

The single-threaded execution finishes pretty quickly so I'm sure I'm giving the parallel job case enough time to do the calculation itself.

Environment specs: MacBook Pro (15-inch, Mid 2010), 2.4 GHz Intel Core i5, 8 GB 1067 MHz DDR3, OSX 10.10.5, python 3.4.3, ipython 3.2.0, numpy v1.9.3, scipy 0.16.0, scikit-learn v0.16.1 (python and packages all from anaconda distro)

Some additional comments:

I use n_jobs=-1 with GridSearchCV all the time on this machine without issue, so my platform does support the functionality. It usually has 4 jobs out a time, as I've got 4 cores on this machine (2 physical, but 4 "virtual cores" due to Mac hyperthreading). But unless I misunderstand the console output, in this case it has 8 jobs out without any returning. Watching CPU usage in Activity Monitor in real time, 4 jobs launch, work a bit, then finish (or die?) followed by 4 more that launch, work a bit, and then go completely idle but stick around.

At no point do I see significant memory pressure. The main process tops at about 1GB real mem, the child processes at around 600MB. By the time they hang, real memory is negligible.

The script works fine with multiple jobs if one removes the TruncatedSVD step from the feature extraction pipeline. Note, though, that this pipeline acts before the grid search and is not part of the GridSearchCV job(s).

This script is for the kaggle competition What's Cooking? so if you want to try run it on the same data I'm using, you can grab it from there. The data comes as a JSON array of objects. Each object represents a recipe and contains a list of text snippets which are the ingredients. Since each sample is a collection of documents instead of a single document, I ended up having to write some of my own n-gramming and tokenization logic since I couldn't figure out how to get the built-in transformers of scikit-learn to do exactly what I want. I doubt any of that matters but just an FYI.

I usually run scripts within the iPython CLI with %run, but I get the same behavior running it from the OSX bash terminal with python (3.4.3) directly.

338

asked Oct 09 '15 15:10

josh314

1 Answers

This might be an issue with multiprocessing used by GridSearchCV if njob>1. So rather than using multiprocessing, you can try multithreading to see if it works fine.

from sklearn.externals.joblib import parallel_backend

clf = GridSearchCV(...)
with parallel_backend('threading'):
    clf.fit(x_train, y_train)

I was having the same issue with my estimator using GSV with njob >1 and using this works great across njob values.

PS: I am not sure if "threading" would have same advantages as "multiprocessing" for all estimators. But theoretically, "threading" would not be a great choice if your estimator is limited by GIL but if the estimator is a cython/numpy based it would be better than "multiprocessing"

System tried on:

MAC OS: 10.12.6
Python: 3.6
numpy==1.13.3
pandas==0.21.0
scikit-learn==0.19.1

answered Oct 13 '22 20:10

Trideep Rath

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parallel jobs don't finish in scikit-learn's GridSearchCV

Tags:

python

macos

multithreading

machine-learning

scikit-learn

josh314

People also ask

1 Answers

Trideep Rath

Recent Activity

Donate For Us

Parallel jobs don't finish in scikit-learn's GridSearchCV

Tags:

python

macos

multithreading

machine-learning

scikit-learn

josh314

People also ask

1 Answers

Trideep Rath

Related questions

Recent Activity

Donate For Us