I am using scikit for my machine learning purposes . While I followed the steps exactly as mentioned in its official documentation but I encounter two problems. Here is the main part of the code :
1) trdata is training data created using sklearn.train_test_split. 2) ptest and ntest is test data of positives and negatives respectively
## Preprocessing
scaler = StandardScaler(); scaler.fit(trdata);
trdata = scaler.transform(trdata)
ptest = scaler.transform(ptest); ntest = scaler.transform(ntest)
## Building Classifier
# setting gamma and C for grid search optimization, RBF Kernel and SVM classifier
crange = 10.0**np.arange(-2,9); grange = 10.0**np.arange(-5,4)
pgrid = dict(gamma = grange, C = crange)
cv = StratifiedKFold(y = tg, n_folds = 3)
## Threshold Ranging
clf = GridSearchCV(SVC(),param_grid = pgrid, cv = cv, n_jobs = 8)
## Training Classifier: Semi Supervised Algorithm
clf.fit(trdata,tg,n_jobs=8)
Problem 1) When I use n_jobs = 8 in GridSearchCV, the code runs till GridSearchCV but hangs or say takes exceptionally long time without result in executing 'clf.fit' , even for a very small dataset. When I remove it then both execute but clf.fit takes very long time to converge for large datasets. My data size is 600 x 12 matrix for both positive and negatives. Can you tell me what exactly n_jobs will do and how it should be used? Also is there any faster fitting technique or modification in code that can be applied to make it faster ?
Problem 2) also StandardScaler should be used upon positive and negative data combined or separately for both ? I suppose it has to be used combined because then only we can use the scaler parameters upon the test sets.
SVC seems to be very sensitive to the data that is not normalized, you may try to normalize the data by:
from sklearn import preprocessing
trdata = preprocessing.scale(trdata)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With