I want to shuffle a large file with millions of lines of strings in Linux. I tried 'sort -R' But it is very slow (takes like 50 mins for a 16M big file). Is there a faster utility that I can use in the place of it?
The shuf command generates random permutations from input lines to standard output. If given a file or series of files it will shuffle the lines and write the result to standard output. It can also limit the number of results returned supporting selecting random lines from a file or data from a list.
To sort lines of text files, we use the sort command in the Linux system. The sort command is used to prints the lines of its input or concatenation of all files listed in its argument list in sorted order. The operation of sorting is done based on one or more sort keys extracted from each line of input.
The shuf command in Linux writes a random permutation of the input lines to standard output. It pseudo randomizes an input in the same way as the cards are shuffled.
Use shuf
instead of sort -R
(man page).
The slowness of sort -R
is probably due to it hashing every line. shuf
just does a random permutation so it doesn't have that problem.
(This was suggested in a comment but for some reason not written as an answer by anyone)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With