How could the UNIX sort command sort a very large file?

People also ask

How do I sort large files in Unix?

The -S option is the key, telling the ls command to sort the file listing by size. The -h option tells ls to make the output human readable, and -r tells it to reverse the output, so in this case the largest files are shown at the end of the output.

How does the Unix sort command work?

The sort command sorts the contents of a file, in numeric or alphabetic order, and prints the results to standard output (usually the terminal screen). The original file is unaffected. The output of the sort command will then be stored in a file named newfilename in the current directory.

The Algorithmic details of UNIX Sort command says Unix Sort uses an External R-Way merge sorting algorithm. The link goes into more details, but in essence it divides the input up into smaller portions (that fit into memory) and then merges each portion together at the end.

The sort command stores working data in temporary disk files (usually in /tmp).

#!/bin/bash

usage ()
{
    echo Parallel sort
    echo usage: psort file1 file2
    echo Sorts text file file1 and stores the output in file2
}

# test if we have two arguments on the command line
if [ $# != 2 ]
then
    usage
    exit
fi

pv $1 | parallel --pipe --files sort -S512M | parallel -Xj1 sort -S1024M -m {} ';' rm {} > $2

WARNING: This script starts one shell per chunk, for really large files, this could be hundreds.

Here is a script I wrote for this purpose. On a 4 processor machine it improved the sort performance by 100% !

#! /bin/ksh

MAX_LINES_PER_CHUNK=1000000
ORIGINAL_FILE=$1
SORTED_FILE=$2
CHUNK_FILE_PREFIX=$ORIGINAL_FILE.split.
SORTED_CHUNK_FILES=$CHUNK_FILE_PREFIX*.sorted

usage ()
{
     echo Parallel sort
     echo usage: psort file1 file2
     echo Sorts text file file1 and stores the output in file2
     echo Note: file1 will be split in chunks up to $MAX_LINES_PER_CHUNK lines
     echo  and each chunk will be sorted in parallel
}

# test if we have two arguments on the command line
if [ $# != 2 ]
then
    usage
    exit
fi

#Cleanup any lefover files
rm -f $SORTED_CHUNK_FILES > /dev/null
rm -f $CHUNK_FILE_PREFIX* > /dev/null
rm -f $SORTED_FILE

#Splitting $ORIGINAL_FILE into chunks ...
split -l $MAX_LINES_PER_CHUNK $ORIGINAL_FILE $CHUNK_FILE_PREFIX

for file in $CHUNK_FILE_PREFIX*
do
    sort $file > $file.sorted &
done
wait

#Merging chunks to $SORTED_FILE ...
sort -m $SORTED_CHUNK_FILES > $SORTED_FILE

#Cleanup any lefover files
rm -f $SORTED_CHUNK_FILES > /dev/null
rm -f $CHUNK_FILE_PREFIX* > /dev/null

See also: "Sorting large files faster with a shell script"

I'm not familiar with the program but I guess it is done by means of external sorting (most of the problem is held in temporary files while relatively small part of the problem is held in memory at a time). See Donald Knuth's The Art of Computer Programming, Vol. 3 Sorting and Searching, Section 5.4 for very in-depth discussion of the subject.

Related questions
                            
                                Simulating ENTER keypress in bash script
                            
                                How to set shell for npm run-scripts in Windows
                            
                                How to make grep only match if the entire line matches?
                            
                                shell-script headers (#!/bin/sh vs #!/bin/csh)
                            
                                Is there a way to change the environment variables of another process in Unix?
                            
                                Open a file from Cygwin
                            
                                Call Python script from bash with argument
                            
                                grep -P no longer works. How can I rewrite my searches?
                            
                                How to convert a string from uppercase to lowercase in Bash? [duplicate]
                            
                                How to substitute shell variables in complex text files
                            
                                Compare two folders which have many files inside contents
                            
                                Batch renaming files with Bash
                            
                                How can Bash execute a command in a different directory context?
                            
                                Test if a variable is set in bash when using "set -o nounset"
                            
                                Detect if homebrew package is installed
                            
                                node.js shell command execution
                            
                                run `nvm use` automatically every time there's a .nvmrc file on the directory
                            
                                Run java jar file on a server as background process
                            
                                Turning multi-line string into single comma-separated
                            
                                Press alt + numeric in bash and you get (arg [numeric]) what is that?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How could the UNIX sort command sort a very large file?

Tags:

shell

sorting

People also ask

Recent Activity

Donate For Us