Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Disk space required for unix sort

I am currently doing a UNIX sort (via GitBash on a Windows machine) of a 500GB text file. Due to running out of space on the main disk, I have used the -T option to direct the temp files to a disk where I have enough space to accommodate the entire file. The thing is, I've been watching the disk space and apparently the temp files are already in excess of what the original file was. I don't know how much further this is going to go, but I'm wondering if there is a rule by which I can predict how much space I will need for temp files.

like image 839
Stonecraft Avatar asked Aug 10 '16 15:08

Stonecraft


1 Answers

I'd batch it manually as described in this unix.SE answer.

Find some very basic queries that will divide your content into chunks that are small enough to be sorted. For example, if it's a file of words, you could create queries like grep ^a …, grep ^b …, and so on. Some items may need more granularity than others.

You can script that like:

#!/bin/bash
for char1 in other {0..9} {a..z}; do
  out="/tmp/sort.$char1.xz"
  echo "Extracting lines starting with '$char1'"
  if [ "$char1" = "other" ]; then char1='[^a-z0-9]'; fi
  grep -i "^$char1" *.txt |xz -c0 > "$out"
  unxz -c "$out" |sort -u >> output.txt || exit 1
  rm "$out"
done
echo "It worked"

I'm using xz -0 because it's almost as fast as gzip's default gzip -6 yet it's vastly better at conserving space. I omitted it from the final output in order to preserve the exit value of sort -u, but you could instead use a size check (iirc, sort fails with zero output) and then use sort -u |xz -c0 >> output.txt.xz since the xz (and gzip) container lets you concatenate archives (I've written about that before too).

This works because the output of each grep run is already sorted (0 is before 1, which is before a, etc.), so the final assembly doesn't need to run through sort (note, the "other" section will be slightly different since some non-alphanumeric characters are before the numbers, others are between numbers and letters, and others still are after the letters. You can also remove grep's -i flag and additionally iterate through {A..Z} to be case sensitive). Each individual iteration obviously still needs to be sorted, but hopefully they're manageable.

If the program exits before completing all iterations and saying "It worked" then you can edit the script with a more discrete batch for the last iteration it tried. Remove all prior iterations since they're successfully saved in output.txt.

like image 99
Adam Katz Avatar answered Oct 18 '22 11:10

Adam Katz