shuffle a large list of items without loading in memory

Tags:

I have a file with ~2 billion lines of text (~200gigs). I want to produce a new file containing the same text lines, but shuffled randomly by line. I can't hold all the data in memory. Is there a good way to do this in python/command line that takes a reasonable amount of time (couple of days)?

I was thinking I could I touch 50 empty files. Stream through the 2 billion line file and randomly distribute each line to one of the 50 empty files. Then cat the 50 files. Would there be any major systematic bias to this method?

552

asked Jun 30 '14 14:06

daemonk

1 Answers

If you can reserve 16 GB of memory for this program, I wrote a program called sample that shuffles the lines of a file by reading in their byte offsets, shuffling the offsets, and then printing output by seeking through the file to the shuffled offsets. It uses 8 bytes for each 64-bit offset, thus 16 GB for a two billion-line input.

It won't be fast, but on a system with enough memory, sample will shuffle files that are large enough to cause GNU shuf to fail. Further, it uses mmap routines to try to minimize the I/O expense of a second pass through your file. It also has a few other options; see --help for more details.

By default, this program will sample without replacement and shuffle by single lines. If you want to shuffle with replacement, or if your input is in FASTA, FASTQ or another multi-line format, you can add some options to adjust how sampling is done. (Or you can apply an alternative approach, which I link to in a Perl gist below, but sample addresses these cases.)

If your FASTA sequences are on every two lines, that is, they alternate between sequence header on one line and sequence data on the next, you can still shuffle with sample, and with half the memory, since you are only shuffling half the number of offsets. See the --lines-per-offset option; you'd specify 2, for instance, to shuffle pairs of lines.

In the case of FASTQ files, their records are split every four lines. You can specify --lines-per-offset=4 to shuffle a FASTQ file with a fourth of the memory required to shuffle a single-line file.

Alternatively, I have a gist here written in Perl, which will sample sequences without replacement from a FASTA file without regard for the number of lines in a sequence. Note that this isn't exactly the same as shuffling a whole file, but you could use this as a starting point, since it collects the offsets. Instead of sampling some of the offsets, you'd remove line 47 that sorts shuffled indices, then use file seek operations to read through the file, using the shuffled-index list directly.

Again, it won't be fast, because you are jumping through a very large file out of order, but storing offsets is much less expensive than storing whole lines, and adding mmap routines could help a little with what is essentially a series of random access operations. And if you are working with FASTA, you'll have still fewer offsets to store, so your memory usage (excepting any relatively insignificant container and program overhead) should be at most 8 GB — and likely less, depending on its structure.

answered Sep 20 '22 01:09

Alex Reynolds

Related questions
                            
                                if-else vs "or" operation for None-check
                            
                                Equivalent of `package.json' and `package-lock.json` for `pip`
                            
                                Flask session doesn't update consistently with parallel requests
                            
                                Advice regarding IPython + MacVim Workflow
                            
                                Is it a good idea to have a syntax sugar to function composition in Python?
                            
                                How do you extend python with C++?
                            
                                Outlook PST File Parsing in Python [closed]
                            
                                How to insert arrays into a database?
                            
                                How to get PyCharm to auto-complete code in methods?
                            
                                What does it mean in linux scripts? #!/usr/bin/python -tt
                            
                                Python/SQLite3: cannot commit - no transaction is active
                            
                                How to log memory usage of an Django app per request
                            
                                matplotlib savefig image size with bbox_inches='tight'
                            
                                Numpy: 1D array with various shape
                            
                                Python: URLError: <urlopen error [Errno 10060]
                            
                                Histogram from data which is already binned, I have bins and frequency values
                            
                                Upload a file to a python flask server using curl
                            
                                Get minimum value field name using aggregation in django
                            
                                Python: Frequency of occurrences
                            
                                How to inspect variables after Traceback?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

shuffle a large list of items without loading in memory

Tags:

python

shuffle

daemonk

People also ask

1 Answers

Alex Reynolds

Recent Activity

Donate For Us