In the following script, I'm finding that the jobs launched by GridSearchCV seem to hang.
import json
import pandas as pd
import numpy as np
import unicodedata
import re
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import SGDClassifier
import sklearn.cross_validation as CV
from sklearn.grid_search import GridSearchCV
from nltk.stem import WordNetLemmatizer
# Seed for randomization. Set to some definite integer for debugging and set to None for production
seed = None
### Text processing functions ###
def normalize(string):#Remove diacritics and whatevs
return "".join(ch.lower() for ch in unicodedata.normalize('NFD', string) if not unicodedata.combining(ch))
wnl = WordNetLemmatizer()
def tokenize(string):#Ignores special characters and punct
return [wnl.lemmatize(token) for token in re.compile('\w\w+').findall(string)]
def ngrammer(tokens):#Gets all grams in each ingredient
max_n = 2
return [":".join(tokens[idx:idx+n]) for n in np.arange(1,1 + min(max_n,len(tokens))) for idx in range(len(tokens) + 1 - n)]
print("Importing training data...")
with open('/Users/josh/dev/kaggle/whats-cooking/data/train.json','rt') as file:
recipes_train_json = json.load(file)
# Build the grams for the training data
print('\nBuilding n-grams from input data...')
for recipe in recipes_train_json:
recipe['grams'] = [term for ingredient in recipe['ingredients'] for term in ngrammer(tokenize(normalize(ingredient)))]
# Build vocabulary from training data grams.
vocabulary = list({gram for recipe in recipes_train_json for gram in recipe['grams']})
# Stuff everything into a dataframe.
ids_index = pd.Index([recipe['id'] for recipe in recipes_train_json],name='id')
recipes_train = pd.DataFrame([{'cuisine': recipe['cuisine'], 'ingredients': " ".join(recipe['grams'])} for recipe in recipes_train_json],columns=['cuisine','ingredients'], index=ids_index)
# Extract data for fitting
fit_data = recipes_train['ingredients'].values
fit_target = recipes_train['cuisine'].values
# extracting numerical features from the ingredient text
feature_ext = Pipeline([('vect', CountVectorizer(vocabulary=vocabulary)),
('tfidf', TfidfTransformer(use_idf=True)),
('svd', TruncatedSVD(n_components=1000))
])
lsa_fit_data = feature_ext.fit_transform(fit_data)
# Build SGD Classifier
clf = SGDClassifier(random_state=seed)
# Hyperparameter grid for GRidSearchCV.
parameters = {
'alpha': np.logspace(-6,-2,5),
}
# Init GridSearchCV with k-fold CV object
cv = CV.KFold(lsa_fit_data.shape[0], n_folds=3, shuffle=True, random_state=seed)
gs_clf = GridSearchCV(
estimator=clf,
param_grid=parameters,
n_jobs=-1,
cv=cv,
scoring='accuracy',
verbose=2
)
# Fit on training data
print("\nPerforming grid search over hyperparameters...")
gs_clf.fit(lsa_fit_data, fit_target)
The console output is:
Importing training data...
Building n-grams from input data...
Performing grid search over hyperparameters...
Fitting 3 folds for each of 5 candidates, totalling 15 fits
[CV] alpha=1e-06 .....................................................
[CV] alpha=1e-06 .....................................................
[CV] alpha=1e-06 .....................................................
[CV] alpha=1e-05 .....................................................
[CV] alpha=1e-05 .....................................................
[CV] alpha=1e-05 .....................................................
[CV] alpha=0.0001 ....................................................
[CV] alpha=0.0001 ....................................................
And then it just hangs. If I set n_jobs=1
in GridSearchCV
, then the script completes as expected with output:
Importing training data...
Building n-grams from input data...
Performing grid search over hyperparameters...
Fitting 3 folds for each of 5 candidates, totalling 15 fits
[CV] alpha=1e-06 .....................................................
[CV] ............................................ alpha=1e-06 - 6.5s
[Parallel(n_jobs=1)]: Done 1 jobs | elapsed: 6.6s
[CV] alpha=1e-06 .....................................................
[CV] ............................................ alpha=1e-06 - 6.6s
[CV] alpha=1e-06 .....................................................
[CV] ............................................ alpha=1e-06 - 6.7s
[CV] alpha=1e-05 .....................................................
[CV] ............................................ alpha=1e-05 - 6.7s
[CV] alpha=1e-05 .....................................................
[CV] ............................................ alpha=1e-05 - 6.7s
[CV] alpha=1e-05 .....................................................
[CV] ............................................ alpha=1e-05 - 6.6s
[CV] alpha=0.0001 ....................................................
[CV] ........................................... alpha=0.0001 - 6.6s
[CV] alpha=0.0001 ....................................................
[CV] ........................................... alpha=0.0001 - 6.7s
[CV] alpha=0.0001 ....................................................
[CV] ........................................... alpha=0.0001 - 6.7s
[CV] alpha=0.001 .....................................................
[CV] ............................................ alpha=0.001 - 7.0s
[CV] alpha=0.001 .....................................................
[CV] ............................................ alpha=0.001 - 6.8s
[CV] alpha=0.001 .....................................................
[CV] ............................................ alpha=0.001 - 6.6s
[CV] alpha=0.01 ......................................................
[CV] ............................................. alpha=0.01 - 6.7s
[CV] alpha=0.01 ......................................................
[CV] ............................................. alpha=0.01 - 7.3s
[CV] alpha=0.01 ......................................................
[CV] ............................................. alpha=0.01 - 7.1s
[Parallel(n_jobs=1)]: Done 15 out of 15 | elapsed: 1.7min finished
The single-threaded execution finishes pretty quickly so I'm sure I'm giving the parallel job case enough time to do the calculation itself.
Environment specs: MacBook Pro (15-inch, Mid 2010), 2.4 GHz Intel Core i5, 8 GB 1067 MHz DDR3, OSX 10.10.5, python 3.4.3, ipython 3.2.0, numpy v1.9.3, scipy 0.16.0, scikit-learn v0.16.1 (python and packages all from anaconda distro)
Some additional comments:
I use n_jobs=-1
with GridSearchCV
all the time on this machine without issue, so my platform does support the functionality. It usually has 4 jobs out a time, as I've got 4 cores on this machine (2 physical, but 4 "virtual cores" due to Mac hyperthreading). But unless I misunderstand the console output, in this case it has 8 jobs out without any returning. Watching CPU usage in Activity Monitor in real time, 4 jobs launch, work a bit, then finish (or die?) followed by 4 more that launch, work a bit, and then go completely idle but stick around.
At no point do I see significant memory pressure. The main process tops at about 1GB real mem, the child processes at around 600MB. By the time they hang, real memory is negligible.
The script works fine with multiple jobs if one removes the TruncatedSVD
step from the feature extraction pipeline. Note, though, that this pipeline acts before the grid search and is not part of the GridSearchCV
job(s).
This script is for the kaggle competition What's Cooking? so if you want to try run it on the same data I'm using, you can grab it from there. The data comes as a JSON array of objects. Each object represents a recipe and contains a list of text snippets which are the ingredients. Since each sample is a collection of documents instead of a single document, I ended up having to write some of my own n-gramming and tokenization logic since I couldn't figure out how to get the built-in transformers of scikit-learn to do exactly what I want. I doubt any of that matters but just an FYI.
I usually run scripts within the iPython CLI with %run, but I get the same behavior running it from the OSX bash terminal with python (3.4.3) directly.
Let's say for example I have 4 processors available, each processor should fit the model 180/4=45 times. Now, if on average my model takes 10sec to train, I'm estimating around 45⋅10/60=7.5min training time.
The grid. best_score_ is the average of all cv folds for a single combination of the parameters you specify in the tuned_params . In order to access other relevant details about the grid searching process, you can look at the grid.
In GridSearchCV, along with Grid Search, cross-validation is also performed. Cross-Validation is used while training the model. As we know that before training the model with data, we divide the data into two parts – train data and test data.
n_jobs: number of processes you wish to run in parallel for this task if it -1 it will use all available processors.
This might be an issue with multiprocessing used by GridSearchCV if njob>1. So rather than using multiprocessing, you can try multithreading to see if it works fine.
from sklearn.externals.joblib import parallel_backend
clf = GridSearchCV(...)
with parallel_backend('threading'):
clf.fit(x_train, y_train)
I was having the same issue with my estimator using GSV with njob >1 and using this works great across njob values.
PS: I am not sure if "threading" would have same advantages as "multiprocessing" for all estimators. But theoretically, "threading" would not be a great choice if your estimator is limited by GIL but if the estimator is a cython/numpy based it would be better than "multiprocessing"
System tried on:
MAC OS: 10.12.6
Python: 3.6
numpy==1.13.3
pandas==0.21.0
scikit-learn==0.19.1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With