How Keras IMDB dataset data is preprocessed?

Question

I'm working on a problem of sentiment analysis and have a dataset, which is very similar to Kears imdb dataset. When I load Keras’s imdb dataset, it returned sequence of word index.

(X_train, y_train), (X_test, y_test) = imdb.load_data()
X_train[0]
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 22665, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 21631, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 31050, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]

But, I want to understand, how this sequence is constructed. In my dataset, I used CountVectorizer, with ngram_range=(1,2) in my dataset to tokenize words, but I want to try to replicate Keras approach.

codeslord · Accepted Answer

The words in the imdb dataset is replaced with an integer representing how frequently they occur in the dataset. When you are calling the load_data function for the first time it will download the dataset.

To see how the value is calculated, let's take a snippet of code from the source code(link is provided at the end)

idx = len(x_train)
x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])

x_train is the numpy array from list xs of length x_train;

xs is the list formed from all the words in x_train and x_test by first extracting each item (movie review) from the dataset and then extracting the words. The position of each words are then added to index_from which specifies the actual index to start from (defaults to 3) and then added to starting character (1 by default so that the values start from 1 as padding will be done with zeros)

numpy arrays x_train, y_train, x_test, y_test formed in a similar manner and returned by the load_data function.

The source code is available here.

https://github.com/keras-team/keras/blob/master/keras/datasets/imdb.py

vumaasha · Answer

As explained here

Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). e.g a sentence is preprocessed like I am coming home => [ 1, 3, 11, 15]. Here 1 is the vocabulary index for the word I
words are indexed by overall frequency in the dataset. i.e if you are using a CountVectorizer, you need to sort the vocabulary in the descending order of the frequency. Then the resulting order of words corresponding to their vocabulary indices.

How Keras IMDB dataset data is preprocessed?

Tags:

python

keras

Daniel Chepenko

2 Answers

codeslord

vumaasha

Recent Activity

Donate For Us

How Keras IMDB dataset data is preprocessed?

Tags:

python

keras

Daniel Chepenko

2 Answers

codeslord

vumaasha

Related questions

Recent Activity

Donate For Us