Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How Keras IMDB dataset data is preprocessed?

Tags:

python

keras

I'm working on a problem of sentiment analysis and have a dataset, which is very similar to Kears imdb dataset. When I load Keras’s imdb dataset, it returned sequence of word index.

(X_train, y_train), (X_test, y_test) = imdb.load_data()
X_train[0]
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 22665, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 21631, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 31050, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]

But, I want to understand, how this sequence is constructed. In my dataset, I used CountVectorizer, with ngram_range=(1,2) in my dataset to tokenize words, but I want to try to replicate Keras approach.

like image 729
Daniel Chepenko Avatar asked Mar 06 '23 17:03

Daniel Chepenko


2 Answers

The words in the imdb dataset is replaced with an integer representing how frequently they occur in the dataset. When you are calling the load_data function for the first time it will download the dataset.

To see how the value is calculated, let's take a snippet of code from the source code(link is provided at the end)

idx = len(x_train)
x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])

x_train is the numpy array from list xs of length x_train;

xs is the list formed from all the words in x_train and x_test by first extracting each item (movie review) from the dataset and then extracting the words. The position of each words are then added to index_from which specifies the actual index to start from (defaults to 3) and then added to starting character (1 by default so that the values start from 1 as padding will be done with zeros)

numpy arrays x_train, y_train, x_test, y_test formed in a similar manner and returned by the load_data function.

The source code is available here.

https://github.com/keras-team/keras/blob/master/keras/datasets/imdb.py

like image 57
codeslord Avatar answered Mar 11 '23 04:03

codeslord


As explained here

  1. Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). e.g a sentence is preprocessed like I am coming home => [ 1, 3, 11, 15]. Here 1 is the vocabulary index for the word I

  2. words are indexed by overall frequency in the dataset. i.e if you are using a CountVectorizer, you need to sort the vocabulary in the descending order of the frequency. Then the resulting order of words corresponding to their vocabulary indices.

like image 32
vumaasha Avatar answered Mar 11 '23 04:03

vumaasha