I'm trying to do some text classification using MultinomialNB, but I'm running into problems because my data is unbalanced. (Below is some sample data for simplicity. In actuality, mine is much larger.) I'm trying to resample my data using over-sampling, and I would ideally like to build it into this pipeline.
The pipeline below works fine without over-sampling, but again, in real life my data requires it. It's very imbalanced.
With this current code, I keep getting the error: "TypeError: All intermediate steps should be transformers and implement fit and transform."
How do I build RandomOverSampler into this pipeline?
data = [['round red fruit that is sweet','apple'],['long yellow fruit with a peel','banana'],
['round green fruit that is soft and sweet','pear'], ['red fruit that is common', 'apple'],
['tiny fruits that grow in bunches','grapes'],['purple fruits', 'grapes'], ['yellow and long', 'banana'],
['round, small, green', 'grapes'], ['can be red, green, or purple', 'grapes'], ['tiny fruits', 'grapes'],
['small fruits', 'grapes']]
df = pd.DataFrame(data,columns=['Description','Type'])
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('RUS', RandomOverSampler()),
('clf', MultinomialNB())])
text_clf = text_clf.fit(X_train, y_train)
y_pred = text_clf.predict(X_test)
print('Score:',text_clf.score(X_test, y_test))
You should use the Pipeline implemented in the imblearn
package, not the one from sklearn
. E.g., this code runs fine:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import Pipeline
data = [['round red fruit that is sweet','apple'],['long yellow fruit with a peel','banana'],
['round green fruit that is soft and sweet','pear'], ['red fruit that is common', 'apple'],
['tiny fruits that grow in bunches','grapes'],['purple fruits', 'grapes'], ['yellow and long', 'banana'],
['round, small, green', 'grapes'], ['can be red, green, or purple', 'grapes'], ['tiny fruits', 'grapes'],
['small fruits', 'grapes']]
df = pd.DataFrame(data, columns=['Description','Type'])
X_train, X_test, y_train, y_test = train_test_split(df['Description'],
df['Type'], random_state=0)
text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('RUS', RandomOverSampler()),
('clf', MultinomialNB())])
text_clf = text_clf.fit(X_train, y_train)
y_pred = text_clf.predict(X_test)
print('Score:',text_clf.score(X_test, y_test))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With