Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

bash - shuffle a file that is too large to fit in memory

Tags:

bash

I've got a file that's too large to fit in memory. shuf seems to run in RAM, and sort -R doesn't shuffle (identical lines end up next to each other; I need all of the lines to be shuffled). Are there any options other than rolling my own solution?

like image 357
George Avatar asked Nov 26 '16 04:11

George


2 Answers

Using a form of decorate-sort-undecorate pattern and awk you can do something like:

$ seq 10 | awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}' | sort -n | cut -c8-
8
5
1
9
6
3
7
2
10
4

For a file, you would do:

$ awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}' SORTED.TXT | sort -n | cut -c8- > SHUFFLED.TXT

or cat the file at the start of the pipeline.

This works by generating a column of random numbers between 000000 and 999999 inclusive (decorate); sorting on that column (sort); then deleting the column (undecorate). That should work on platforms where sort does not understand numerics by generating a column with leading zeros for lexicographic sorting.

You can increase that randomization, if desired, in several ways:

  1. If your platform's sort understands numerical values (POSIX, GNU and BSD do) you can do awk 'BEGIN{srand();} {printf "%0.15f\t%s\n", rand(), $0;}' FILE.TXT | sort -n | cut -f 2- to use a near double float for random representation.

  2. If you are limited to a lexicographic sort, just combine two calls to rand into one column like so: awk 'BEGIN{srand();} {printf "%06d%06d\t%s\n", rand()*1000000,rand()*1000000, $0;}' FILE.TXT | sort -n | cut -f 2- which gives a composite 12 digits of randomization.

like image 61
dawg Avatar answered Sep 25 '22 10:09

dawg


Count lines (wc -l) and generate a list of numbers corresponding to line numbers, in a random order - perhaps by generating a list of numbers in a temp file (use /tmp/, which is in RAM typically, and thus relatively fast). Then copy the line corresponding to each number to the target file in the order of the shuffled numbers.

This would be time-inefficient, because of the amount of seeking for newlines in the file, but it would work on almost any size of file.

like image 37
Leonora Tindall Avatar answered Sep 23 '22 10:09

Leonora Tindall