What standard commands can I use to print just the first few lines of sorted output on the command line efficiently?

Tags:

I basically want the equivalent of

... | sort -arg1 -arg2 -... | head -n $k

but, my understanding is that sort will go O(n log n) over the whole input. In my case I'm dealing with lots of data, so runtime matters to me - and also I have a habit of overflowing my tmp/ folder with sort temporary files.

I'd rather have it go O(n log k) using e.g. a heap, which would presumably go faster, and which also reduces the working set memory to k as well.

Is there some combination of standard command-line tools that can do this efficiently, without me having to code something myself? Ideally it would support the full expressive sort power of the sort command. sort (on ubuntu at least) appears to have no man-page-documented switch to pull it off...

664

asked Feb 14 '13 19:02

jdowdell

1 Answers

Based on the above, and some more poking, I'd say the official answer to my question is "there is no solution." You can use specialized tools, or you can use the tools you've got with their current performance, or you can write your own tool.

I'm debating tracking down the sort source code and offering a patch. In the meantime, in case this quick hack code helps for anybody doing something similar to what I was doing, here's what I wrote for myself. Not the best python, and a very shady benchmark: I offer it to anybody else who cares to provide more rigorous:

256 files, of about 1.6 Gigs total size, all sitting on an ssd, lines separated by \n, lines of format [^\t]*\t[0-9]+
Ubuntu 10.4, 6 cores, 8 gigs of ram, /tmp on ssd as well.
$ time sort -t^v<tab> -k2,2n foo* | tail -10000
- real 7m26.444s
- user 7m19.790s
- sys 0m17.530s
$ time python test.py 10000 foo*
- real 1m29.935s
- user 1m28.640s
- sys 0m1.220s
using diff to analyze, the two methods differ on tie-breaking, but otherwise the sort order is the same.

test.py:

#!/usr/bin/env python
# test.py

from sys import argv
import heapq
from itertools import chain

# parse N - the size of the heap, and confirm we can open all input files
N = int(argv[1])
streams = [open(f, "r") for f in argv[2:]]

def line_iterator_to_tuple_iterator(line_i):
    for line in line_i:
        s,c = line.split("\t")
        c = int(c)
        yield (c, s)

# use heap to process inputs
rez = heapq.nlargest(N,
               line_iterator_to_tuple_iterator(chain(*streams)),
               key=lambda x: x[0])

for r in rez:
    print "%s\t%s" % (r[1], r[0])

for s in streams:
    s.close()

178

answered Oct 09 '22 21:10

jdowdell

Related questions
                            
                                Where can I get a PalmOS SDK?
                            
                                How shared IRQ races are avoided in Linux
                            
                                Ignoring a system call
                            
                                Streaming DMA in PCIE linux kernel driver
                            
                                How to reliably measure the network bandwidth used by a process
                            
                                What is the order in which File Descriptors in epoll are returned?
                            
                                Mounting webdav shares using davfs2
                            
                                Why are Universal Binaries/FatElf not part of Linux kernel? [closed]
                            
                                How do I subscribe to supplementary server channel to install sun jdk 6 on RHEL 4.x
                            
                                gcc:linker input file unused because linking not done
                            
                                How can a Linux character device driver detect when a program using it exits abnormally?
                            
                                Modify Python Path for Python3 only
                            
                                Zeitgeist Command Line Query Client
                            
                                What is the fastest way to get the list of files recursively contained in a directory?
                            
                                ncurses basic example - in debug i get: "Error opening terminal: unknown."
                            
                                how to debug an NASM assembly program requiring user input?
                            
                                Linux concurrency scripting with mutexes
                            
                                What happens to TCP connection when a computer switches to sleep mode? [closed]
                            
                                How to flush nfs attribute cache?
                            
                                changing thread priority and scheduler in linux

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What standard commands can I use to print just the first few lines of sorted output on the command line efficiently?

Tags:

linux

bash

unix

jdowdell

People also ask

1 Answers

jdowdell

Recent Activity

Donate For Us