Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to apply SMOTE technique (oversampling) before word embedding layer

How to apply SMOTE algorithm before word embedding layer in LSTM.

I have a problem of text binary classification (Good(9500) or Bad(500) review with total of 10000 training sample and it's unbalanced training sample), mean while i am using LSTM with pre-trained word-embeddings (100 dimension space for each word) as well, so each training input have an id's (Total of 50 ids with zero padding's as well when the text description is having lesser than 50 words and trimmed to 50 when the description is exceeded 50 characters) of word dictionary.

Below is my general flow,

  • Input - 1000(batch) X 50 (sequence length)
  • Word Embedding - 200(Unique vocabulary word) X 100 (word representation)
  • After word embedding layer (new input for LSTM) - 1000(batch) X 50(sequence) X 100 (features)
  • Final State from LSTM 1000 (batch) X 100 (units)
  • Apply final layer 1000(batch) X 100 X [100(units) X 2 (output class)]

All i want to generate more data for Bad review with the help of SMOTE

like image 906
user1531248 Avatar asked Nov 19 '18 23:11

user1531248


People also ask

Can smote be applied on text data?

You need to balance the distribution for your classifier not for a reader of text data. So apply SMOTE as traditional (however I usually use the solution 2 bellow so I do not gaurantee the result!) with some Dimensionality Reduction step.

How does smote deal with imbalanced data?

For an imbalanced dataset, first SMOTE is applied to create new synthetic minority samples to get a balanced distribution. Further, Tomek Links is used in removing the samples close to the boundary of the two classes, to increase the separation between the two classes.

How do you choose a sampling strategy in smote?

The SMOTE algorithm works as follows: You draw a random sample from the minority class. For the observations in this sample, you will identify the k nearest neighbors. You will then take one of those neighbors and identify the vector between the current data point and the selected neighbor.

How does smote oversampling work?

SMOTE works by utilizing a k-nearest neighbour algorithm to create synthetic data. SMOTE first start by choosing random data from the minority class, then k-nearest neighbours from the data are set. Synthetic data would then be made between the random data and the randomly selected k-nearest neighbour.


1 Answers

I faced the same issue. Found this post on stackexchange which proposes to adjust the weights of the class distribution instead of oversampling. Apparently it is the standard way in LSTM / RNN to deal with class imbalance.

https://stats.stackexchange.com/questions/342170/how-to-train-an-lstm-when-the-sequence-has-imbalanced-classes

like image 93
clagger Avatar answered Nov 10 '22 23:11

clagger