I have applied svm on my dataset. my dataset is multi-label means each observation has more than one label.
while KFold cross-validation
it raises an error not in index
.
It shows the index from 601 to 6007 not in index
(I have 1...6008 data samples).
This is my code:
df = pd.read_csv("finalupdatedothers.csv")
categories = ['ADR','WD','EF','INF','SSI','DI','others']
X= df[['sentences']]
y = df[['ADR','WD','EF','INF','SSI','DI','others']]
kf = KFold(n_splits=10)
kf.get_n_splits(X)
for train_index, test_index in kf.split(X,y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
SVC_pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words=stop_words)),
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
])
for category in categories:
print('... Processing {} '.format(category))
# train the model using X_dtm & y
SVC_pipeline.fit(X_train['sentences'], y_train[category])
prediction = SVC_pipeline.predict(X_test['sentences'])
print('SVM Linear Test accuracy is {} '.format(accuracy_score(X_test[category], prediction)))
print 'SVM Linear f1 measurement is {} '.format(f1_score(X_test[category], prediction, average='weighted'))
print([{X_test[i]: categories[prediction[i]]} for i in range(len(list(prediction)))])
Actually, I do not know how to apply KFold cross-validation in which I can get the F1 score and accuracy of each label separately. having looked at this and this did not help me how can I successfully to apply on my case.
for being reproducible, this is a small sample of the data frame the last seven features are my labels including ADR, WD,...
,sentences,ADR,WD,EF,INF,SSI,DI,others
0,"extreme weight gain, short-term memory loss, hair loss.",1,0,0,0,0,0,0
1,I am detoxing from Lexapro now.,0,0,0,0,0,0,1
2,I slowly cut my dosage over several months and took vitamin supplements to help.,0,0,0,0,0,0,1
3,I am now 10 days completely off and OMG is it rough.,0,0,0,0,0,0,1
4,"I have flu-like symptoms, dizziness, major mood swings, lots of anxiety, tiredness.",0,1,0,0,0,0,0
5,I have no idea when this will end.,0,0,0,0,0,0,1
Update
when I did whatever Vivek Kumar said It raises the error
ValueError: Found input variables with inconsistent numbers of samples: [1, 5408]
in classifier part . do you have any idea how to resolve it?
there are a couple of links for this error in stackoverflow which says I need to reshape training data. I also did that but no success link Thanks :)
train_index
, test_index
are integer indices based on the number of rows. But pandas indexing dont work like that. Newer versions of pandas are more strict in how you slice or select data from them.
You need to use .iloc
to access the data. More information is available here
This is what you need:
for train_index, test_index in kf.split(X,y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
...
...
# TfidfVectorizer dont work with DataFrame,
# because iterating a DataFrame gives the column names, not the actual data
# So specify explicitly the column name, to get the sentences
SVC_pipeline.fit(X_train['sentences'], y_train[category])
prediction = SVC_pipeline.predict(X_test['sentences'])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With