I have a quick question about the randomizationWindow parameter of the reader. It says in the documentation it controls how much of the data is in memory – but I’m a little unclear what effect it will have on the randomness of the data. If the training data file starts with one distribution of data, and ends in another completely different distribution, will setting a randomization window smaller than the data size cause the data fed to the trainer not to be from a homogenous distribution? I just wanted to double check.
NOTE: CNTK is no longer actively developed. See the release notes of the final major release for details. The Microsoft Cognitive Toolkit (CNTK) is an open-source toolkit for commercial-grade distributed deep learning. It describes neural networks as a series of computational steps via a directed graph.
CNTK allows the user to easily realize and combine popular model types such as feed-forward DNNs, convolutional neural networks (CNNs) and recurrent neural networks (RNNs/LSTMs).
To give a bit more detail on randomization/IO:
All corpus/data is always splitted in chunks. Chunks help to make IO efficient, because all sequences of a chunk are read in one go (usually a chunk is 32/64MB).
When it comes to randomization, there are two steps there:
When the randomizationWindow is set to a window smaller than the entire data size, the entire data size is chunked into randomizationWindow sized chunks and the order of chunks is randomized. Then within each chunk, the samples are randomized.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With