Uniformly shuffle 5 gigabytes of numpy data

Tags:

I'm training a neural network with about five gigabytes of data stored as numpy arrays. The data are split into chunks of 100000 rows, and I've done six cycles of training over all the chunks in a random order. Unfortunately, the network has begun to overfit. I think it still has capacity to fit the data more closely; my suspicion is that internal regularities within each chunk are starting to contradict one another, and I need to shuffle the data more thoroughly so that it can train on different combinations. I want to try this before going to the trouble of getting more training data.

Does anyone know a good way to generate a new permutation of 3.6 million (very long) rows of numpy data? I thought about using one of these techniques, but writing these arrays using numpy.savetxt produces unbelievably huge files, and I can't tell how to manipulate individual rows from a standard npy file in a way that helps to solve this problem.

Right now, my best idea is to create a permutation of paired indices (c, r) into the data, where c choses a chunk and r choses a row from that chunk. I could store each row in a new preallocated array, and then save it. But I wonder if there's a less horribly I/O-bound solution. Is there some principled way to shuffle random pairs of chunks together until you get a permutation that's statistically independent from the starting permutation?

887

asked Nov 20 '14 21:11

senderle

2 Answers

Among the things I've tried so far, a PyTables solution is currently the best, followed by a solution that uses numpy's support for memmapped arrays. The PyTables solution is not straightforward though. If you use a shuffled array of integers to directly index a PyTables array, it's very slow. Much faster is the following two-step process:

Select a random subset of the array using a boolean index array. This must be done in a chunkwise fashion. If you pass the index array directly to the PyTables array, it's slow.
- Preallocate a numpy array and create a list of slices that split the PyTables array into chunks.
- Read each chunk entirely into memory, and then use the corresponding chunk of the index array to select the correct values for that chunk.
- Store the selected values in the preallocated array.
Then shuffle the preallocated array.

This process produces a permutation as random as a normal shuffling process would. If that doesn't seem obvious, consider this: (n choose x) * x! = x! * n! / (x! * (n - x)!) = n! / (n - x)!. This method is fast enough to do a shuffle-on-load for every training cycle. It's also able to compress the data down to ~650M -- nearly a 90% deflation.

Here's my current implementation; this is called once for every training chunk in the corpus. (The returned arrays are shuffled elsewhere.)

def _h5_fast_bool_ix(self, h5_array, ix, read_chunksize=100000):
    '''Iterate over an h5 array chunkwise to select a random subset
    of the array. `h5_array` should be the array itself; `ix` should
    be a boolean index array with as many values as `h5_array` has
    rows; and you can optionally set the number of rows to read per
    chunk with `read_chunksize` (default is 100000). For some reason
    this is much faster than using `ix` to index the array directly.'''

    n_chunks = h5_array.shape[0] / read_chunksize
    slices = [slice(i * read_chunksize, (i + 1) * read_chunksize)
              for i in range(n_chunks)]

    a = numpy.empty((ix.sum(), h5_array.shape[1]), dtype=float)
    a_start = 0
    for sl in slices:
        chunk = h5_array[sl][ix[sl]]
        a_end = a_start + chunk.shape[0]
        a[a_start:a_end] = chunk
        a_start = a_end

    return a

It's somewhat crazy to me that an O(n^2) approach (iterating over the entire PyTables array for every chunk) is faster in this case than an O(n) approach (randomly selecting each row in one pass). But hey, it works. With a bit more indirection, this could be adapted for loading arbitrary non-random permutations, but that adds more complexity than it's worth here.

The mmap solution is here for reference, for those people who need a pure numpy solution for whatever reason. It shuffles all the data in about 25 minutes, while the above solution manages the same in less than half that time. This should scale linearly too, because mmap allows (relatively) efficient random access.

import numpy
import os
import random

X = []
Y = []

for filename in os.listdir('input'):
    X.append(numpy.load(os.path.join('input', filename), mmap_mode='r'))

for filename in os.listdir('output'):
    Y.append(numpy.load(os.path.join('output', filename), mmap_mode='r'))

indices = [(chunk, row) for chunk, rows in enumerate(X) 
                        for row in range(rows.shape[0])]
random.shuffle(indices)

newchunks = 50
newchunksize = len(indices) / newchunks

for i in range(0, len(indices), newchunksize):
    print i
    rows = [X[chunk][row] for chunk, row in indices[i:i + newchunksize]]
    numpy.save('X_shuffled_' + str(i), numpy.array(rows))
    rows = [Y[chunk][row] for chunk, row in indices[i:i + newchunksize]]
    numpy.save('Y_shuffled_' + str(i), numpy.array(rows))

answered Oct 16 '22 18:10

senderle

The following assumes your data is already divided into easily-retrievable records of some sort. (I don't know if there's a standard file format for numpy data.)

Create an index of the data in the form of a dict, mapping each unique record ID (0 through n - 1) to some means of finding the data again. For instance, if it's all in one binary file, you'd store a tuple of the form (file_offset, record_length). No need to hold onto the data itself.
Create an list of n elements, containing the keys of the index dict (again, 0 through n - 1).
Shuffle the list of record IDs. (Provide your own random number generator, if needed.)
Open a new file (or whatever) to contain the shuffled data.
Read record IDs out of the list from beginning to end. For each record ID, look up that record's location in the index. Grab the data at that location and append it to the output file.

Pseudo-code:

# This assumes a binary file of unequal-length
# records.  It also assumes that the file won't
# be changed while we're doing this.

# Create index.
index = {}
rec_offset = 0
for rec_id, record in original_data.iterate_records():
    # This bit depends greatly on how your data
    # is stored...
    rec_length = len(record)
    index[rec_id] = (rec_offset, rec_length)
    rec_offset += rec_length

# Shuffle.
num_records_indexed = rec_id + 1  # rec_id is still in scope.
records_order = list(range(num_records_indexed))
records_order = random.shuffle(records_order, "<optional_RNG_here>")

# Create new shuffled-data file.
with open("output_file.bin", "wb") as output:
    for rec_id in records_order:
        rec_offset, rec_length = index[rec_id]
        record = original_data.get_rec_at(rec_offset, rec_length)
        output.write(record)

Indexing, shuffling, and de-indexing are all O(n), so the worst part should be I/O: reading the data and then copying it (a second read, plus a write).

answered Oct 16 '22 19:10

Kevin J. Chase

Related questions
                            
                                Python Zen - (only) one way to do it [duplicate]
                            
                                Getting the current shell type from python script
                            
                                Python multiprocessing.Pool() doesn't use 100% of each CPU
                            
                                extract class name from tag beautifulsoup python
                            
                                Error binding parameter 0: probably unsupported type
                            
                                How to write same-name methods with different parameters in Python [duplicate]
                            
                                Django calling REST API from models or views? [closed]
                            
                                Fastest way to grep multiple values from file in python
                            
                                a Django URLField has fixed max_length as 200 characters
                            
                                ImportError: No module named rest_framework.authtoken
                            
                                Python Sockets Peer to Peer
                            
                                Python unpickling stack underflow
                            
                                TypeError: return arrays must be of ArrayType for a function that uses only floats
                            
                                IPython Notebook widgets for Matplotlib interactivity
                            
                                How to iterate over everything in a python-docx document?
                            
                                How to connect to SFTP through Paramiko with SSH key - Pageant
                            
                                Unable to get SSL client certificate working in Tornado
                            
                                python: Interplay between lib/site-packages/site.py and lib/site.py
                            
                                Upload large file nginx + uwsgi
                            
                                Python: TypeError: list indices must be integers, not str

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Uniformly shuffle 5 gigabytes of numpy data

Tags:

python

shuffle

machine-learning

numpy

senderle

People also ask

2 Answers

senderle

Kevin J. Chase

Recent Activity

Donate For Us