Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sort a text file by line length including spaces

Answer

cat testfile | awk '{ print length, $0 }' | sort -n -s | cut -d" " -f2-

Or, to do your original (perhaps unintentional) sub-sorting of any equal-length lines:

cat testfile | awk '{ print length, $0 }' | sort -n | cut -d" " -f2-

In both cases, we have solved your stated problem by moving away from awk for your final cut.

Lines of matching length - what to do in the case of a tie:

The question did not specify whether or not further sorting was wanted for lines of matching length. I've assumed that this is unwanted and suggested the use of -s (--stable) to prevent such lines being sorted against each other, and keep them in the relative order in which they occur in the input.

(Those who want more control of sorting these ties might look at sort's --key option.)

Why the question's attempted solution fails (awk line-rebuilding):

It is interesting to note the difference between:

echo "hello   awk   world" | awk '{print}'
echo "hello   awk   world" | awk '{$1="hello"; print}'

They yield respectively

hello   awk   world
hello awk world

The relevant section of (gawk's) manual only mentions as an aside that awk is going to rebuild the whole of $0 (based on the separator, etc) when you change one field. I guess it's not crazy behaviour. It has this:

"Finally, there are times when it is convenient to force awk to rebuild the entire record, using the current value of the fields and OFS. To do this, use the seemingly innocuous assignment:"

 $1 = $1   # force record to be reconstituted
 print $0  # or whatever else with $0

"This forces awk to rebuild the record."

Test input including some lines of equal length:

aa A line   with     MORE    spaces
bb The very longest line in the file
ccb
9   dd equal len.  Orig pos = 1
500 dd equal len.  Orig pos = 2
ccz
cca
ee A line with  some       spaces
1   dd equal len.  Orig pos = 3
ff
5   dd equal len.  Orig pos = 4
g

The AWK solution from neillb is great if you really want to use awk and it explains why it's a hassle there, but if what you want is to get the job done quickly and don't care what you do it in, one solution is to use Perl's sort() function with a custom caparison routine to iterate over the input lines. Here is a one liner:

perl -e 'print sort { length($a) <=> length($b) } <>'

You can put this in your pipeline wherever you need it, either receiving STDIN (from cat or a shell redirect) or just give the filename to perl as another argument and let it open the file.

In my case I needed the longest lines first, so I swapped out $a and $b in the comparison.


Try this command instead:

awk '{print length, $0}' your-file | sort -n | cut -d " " -f2-

Benchmark results

Below are the results of a benchmark across solutions from other answers to this question.

Test method

  • 10 sequential runs on a fast machine, averaged
  • Perl 5.24
  • awk 3.1.5 (gawk 4.1.0 times were ~2% faster)
  • The input file is a 550MB, 6 million line monstrosity (British National Corpus txt)

Results

  1. Caleb's perl solution took 11.2 seconds
  2. my perl solution took 11.6 seconds
  3. neillb's awk solution #1 took 20 seconds
  4. neillb's awk solution #2 took 23 seconds
  5. anubhava's awk solution took 24 seconds
  6. Jonathan's awk solution took 25 seconds
  7. Fritz's bash solution takes 400x longer than the awk solutions (using a truncated test case of 100000 lines). It works fine, just takes forever.

Another perl solution

perl -ne 'push @a, $_; END{ print sort { length $a <=> length $b } @a }' file