How to apply SMOTE algorithm before word embedding layer in LSTM.
I have a problem of text binary classification (Good(9500) or Bad(500) review with total of 10000 training sample and it's unbalanced training sample), mean while i am using LSTM with pre-trained word-embeddings (100 dimension space for each word) as well, so each training input have an id's (Total of 50 ids with zero padding's as well when the text description is having lesser than 50 words and trimmed to 50 when the description is exceeded 50 characters) of word dictionary.
Below is my general flow,
All i want to generate more data for Bad review with the help of SMOTE
You need to balance the distribution for your classifier not for a reader of text data. So apply SMOTE as traditional (however I usually use the solution 2 bellow so I do not gaurantee the result!) with some Dimensionality Reduction step.
For an imbalanced dataset, first SMOTE is applied to create new synthetic minority samples to get a balanced distribution. Further, Tomek Links is used in removing the samples close to the boundary of the two classes, to increase the separation between the two classes.
The SMOTE algorithm works as follows: You draw a random sample from the minority class. For the observations in this sample, you will identify the k nearest neighbors. You will then take one of those neighbors and identify the vector between the current data point and the selected neighbor.
SMOTE works by utilizing a k-nearest neighbour algorithm to create synthetic data. SMOTE first start by choosing random data from the minority class, then k-nearest neighbours from the data are set. Synthetic data would then be made between the random data and the randomly selected k-nearest neighbour.
I faced the same issue. Found this post on stackexchange which proposes to adjust the weights of the class distribution instead of oversampling. Apparently it is the standard way in LSTM / RNN to deal with class imbalance.
https://stats.stackexchange.com/questions/342170/how-to-train-an-lstm-when-the-sequence-has-imbalanced-classes
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With