Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What standard commands can I use to print just the first few lines of sorted output on the command line efficiently?

Tags:

linux

bash

unix

I basically want the equivalent of

... | sort -arg1 -arg2 -... | head -n $k

but, my understanding is that sort will go O(n log n) over the whole input. In my case I'm dealing with lots of data, so runtime matters to me - and also I have a habit of overflowing my tmp/ folder with sort temporary files.

I'd rather have it go O(n log k) using e.g. a heap, which would presumably go faster, and which also reduces the working set memory to k as well.

Is there some combination of standard command-line tools that can do this efficiently, without me having to code something myself? Ideally it would support the full expressive sort power of the sort command. sort (on ubuntu at least) appears to have no man-page-documented switch to pull it off...

like image 664
jdowdell Avatar asked Feb 14 '13 19:02

jdowdell


People also ask

Which command prints first few lines of a file?

Use the head command to write to standard output the first few lines of each of the specified files or of the standard input.

Which command is best to print just the first line from a huge file?

1. The default command which comes to our mind is the head command. head with the option "-1" displays the first line.

How do I print the first line of output in Linux?

Using the head Command The head command is used to display the first lines of a file. By default, the head command will print only the first 10 lines. The head command ships with the coreutils package, which might be already installed on our machine.

What command is used to sort the lines?

SORT command is used to sort a file, arranging the records in a particular order. By default, the sort command sorts file assuming the contents are ASCII. Using options in the sort command can also be used to sort numerically. SORT command sorts the contents of a text file, line by line.


1 Answers

Based on the above, and some more poking, I'd say the official answer to my question is "there is no solution." You can use specialized tools, or you can use the tools you've got with their current performance, or you can write your own tool.

I'm debating tracking down the sort source code and offering a patch. In the meantime, in case this quick hack code helps for anybody doing something similar to what I was doing, here's what I wrote for myself. Not the best python, and a very shady benchmark: I offer it to anybody else who cares to provide more rigorous:

  • 256 files, of about 1.6 Gigs total size, all sitting on an ssd, lines separated by \n, lines of format [^\t]*\t[0-9]+
  • Ubuntu 10.4, 6 cores, 8 gigs of ram, /tmp on ssd as well.
  • $ time sort -t^v<tab> -k2,2n foo* | tail -10000
    • real 7m26.444s
    • user 7m19.790s
    • sys 0m17.530s
  • $ time python test.py 10000 foo*
    • real 1m29.935s
    • user 1m28.640s
    • sys 0m1.220s
  • using diff to analyze, the two methods differ on tie-breaking, but otherwise the sort order is the same.

test.py:

#!/usr/bin/env python
# test.py

from sys import argv
import heapq
from itertools import chain

# parse N - the size of the heap, and confirm we can open all input files
N = int(argv[1])
streams = [open(f, "r") for f in argv[2:]]

def line_iterator_to_tuple_iterator(line_i):
    for line in line_i:
        s,c = line.split("\t")
        c = int(c)
        yield (c, s)

# use heap to process inputs
rez = heapq.nlargest(N,
               line_iterator_to_tuple_iterator(chain(*streams)),
               key=lambda x: x[0])

for r in rez:
    print "%s\t%s" % (r[1], r[0])

for s in streams:
    s.close()
like image 178
jdowdell Avatar answered Oct 09 '22 21:10

jdowdell