ValuError encounted in SMOTE imblearn.over_sampling

Question

I have been trying to oversample my dataset since it is not balanced. I am doing a binary text classification and would like to keep a ratio of 1 between both my classes. I am trying the SMOTE mechanism to solve the problem.

I followed this tutorial: https://beckernick.github.io/oversampling-modeling/

However, I encounter an error which says:

ValueError: could not convert string to float

Here is my code:

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix, f1_score
from imblearn.over_sampling import SMOTE

data = pd.read_csv("dataset.csv")

nb_pipeline = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range = (1, 10))),
    ('tfidf_transformer', TfidfTransformer()),
    ('classifier', MultinomialNB())
])

k_fold = KFold(n_splits = 10)
nb_f1_scores = []
nb_conf_mat = np.array([[0, 0], [0, 0]])

for train_indices, test_indices in k_fold.split(data):

    train_text = data.iloc[train_indices]['sentence'].values
    train_y = data.iloc[train_indices]['isRelevant'].values

    test_text = data.iloc[test_indices]['sentence'].values
    test_y = data.iloc[test_indices]['isRelevant'].values

    sm = SMOTE(ratio = 1.0)
    train_text_res, train_y_res = sm.fit_sample(train_text, train_y)

    nb_pipeline.fit(train_text, train_y)
    predictions = nb_pipeline.predict(test_text)

    nb_conf_mat += confusion_matrix(test_y, predictions)
    score1 = f1_score(test_y, predictions)
    nb_f1_scores.append(score1)

print("F1 Score: ", sum(nb_f1_scores)/len(nb_f1_scores))
print("Confusion Matrix: ")
print(nb_conf_mat)

Can anyone tell me where I am going wrong, without the two lines of SMOTE, my program works fine.

σηγ · Accepted Answer

You should oversample after vectorizing the text data but before fitting the classifier. This means splitting up the pipeline in the code. The relevant part of the code should be something like this:

nb_pipeline = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range = (1, 10))),
    ('tfidf_transformer', TfidfTransformer())
])

k_fold = KFold(n_splits = 10)
nb_f1_scores = []
nb_conf_mat = np.array([[0, 0], [0, 0]])

for train_indices, test_indices in k_fold.split(data):

    train_text = data.iloc[train_indices]['sentence'].values
    train_y = data.iloc[train_indices]['isRelevant'].values

    test_text = data.iloc[test_indices]['sentence'].values
    test_y = data.iloc[test_indices]['isRelevant'].values

    vectorized_text = nb_pipeline.fit_transform(train_text)

    sm = SMOTE(ratio = 1.0)
    train_text_res, train_y_res = sm.fit_sample(vectorized_text, train_y)

    clf = MultinomialNB()
    clf.fit(train_text_res, train_y_res)
    predictions = clf.predict(nb_pipeline.transform(test_text))

ValuError encounted in SMOTE imblearn.over_sampling

Tags:

python

naivebayes

scikit-learn

Ankur Sinha

1 Answers

σηγ

Recent Activity

Donate For Us

ValuError encounted in SMOTE imblearn.over_sampling

Tags:

python

naivebayes

scikit-learn

Ankur Sinha

1 Answers

σηγ

Related questions

Recent Activity

Donate For Us