Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SMOTE initialisation expects n_neighbors <= n_samples, but n_samples < n_neighbors

Tags:

I have already pre-cleaned the data, and below shows the format of the top 4 rows:

     [IN] df.head()

    [OUT]   Year    cleaned
         0  1909    acquaint hous receiv follow letter clerk crown...
         1  1909    ask secretari state war whether issu statement...
         2  1909    i beg present petit sign upward motor car driv...
         3  1909    i desir ask secretari state war second lieuten...
         4  1909    ask secretari state war whether would introduc...

I have called train_test_split() as follows:

     [IN] X_train, X_test, y_train, y_test = train_test_split(df['cleaned'], df['Year'], random_state=2)
   [Note*] `X_train` and `y_train` are now Pandas.core.series.Series of shape (1785,) and `X_test` and `y_test` are also Pandas.core.series.Series of shape (595,)

I have then vectorized the X training and testing data using the following TfidfVectorizer and fit/transform procedures:

     [IN] v = TfidfVectorizer(decode_error='replace', encoding='utf-8', stop_words='english', ngram_range=(1, 1), sublinear_tf=True)
          X_train = v.fit_transform(X_train)
          X_test = v.transform(X_test)

I'm now at the stage where I would normally apply a classifier, etc (if this were a balanced set of data). However, I initialize imblearn's SMOTE() class (to perform over-sampling)...

     [IN] smote_pipeline = make_pipeline_imb(SMOTE(), classifier(random_state=42))
          smote_model = smote_pipeline.fit(X_train, y_train)
          smote_prediction = smote_model.predict(X_test)

... but this results in:

     [OUT] ValueError: "Expected n_neighbors <= n_samples, but n_samples = 5, n_neighbors = 6.

I've attempted to whittle down the number of n_neighbors but to no avail, any tips or advice would be much appreciated. Thanks for reading.

------------------------------------------------------------------------------------------------------------------------------------

EDIT:

Full Traceback

The dataset/dataframe (df) contains 2380 rows across two columns, as shown in df.head() above. X_train contains 1785 of these rows in the format of a list of strings (df['cleaned']) and y_train also contains 1785 rows in the format of strings (df['Year']).

Post-vectorization using TfidfVectorizer(): X_train and X_test are converted from pandas.core.series.Series of shape '(1785,)' and '(595,)' respectively, to scipy.sparse.csr.csr_matrix of shape '(1785, 126459)' and '(595, 126459)' respectively.

As for the number of classes: using Counter(), I've calculated that there are 199 classes (Years), each instance of a class is attached to one element of aforementioned df['cleaned'] data which contains a list of strings extracted from a textual corpus.

The objective of this process is to automatically determine/guess the year, decade or century (any degree of classification will do!) of input textual data based on vocabularly present.


Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!