I basically want the equivalent of
... | sort -arg1 -arg2 -... | head -n $k
but, my understanding is that sort will go O(n log n) over the whole input. In my case I'm dealing with lots of data, so runtime matters to me - and also I have a habit of overflowing my tmp/ folder with sort temporary files.
I'd rather have it go O(n log k) using e.g. a heap, which would presumably go faster, and which also reduces the working set memory to k as well.
Is there some combination of standard command-line tools that can do this efficiently, without me having to code something myself? Ideally it would support the full expressive sort power of the sort command. sort (on ubuntu at least) appears to have no man-page-documented switch to pull it off...
Use the head command to write to standard output the first few lines of each of the specified files or of the standard input.
1. The default command which comes to our mind is the head command. head with the option "-1" displays the first line.
Using the head Command The head command is used to display the first lines of a file. By default, the head command will print only the first 10 lines. The head command ships with the coreutils package, which might be already installed on our machine.
SORT command is used to sort a file, arranging the records in a particular order. By default, the sort command sorts file assuming the contents are ASCII. Using options in the sort command can also be used to sort numerically. SORT command sorts the contents of a text file, line by line.
Based on the above, and some more poking, I'd say the official answer to my question is "there is no solution." You can use specialized tools, or you can use the tools you've got with their current performance, or you can write your own tool.
I'm debating tracking down the sort source code and offering a patch. In the meantime, in case this quick hack code helps for anybody doing something similar to what I was doing, here's what I wrote for myself. Not the best python, and a very shady benchmark: I offer it to anybody else who cares to provide more rigorous:
$ time sort -t^v<tab> -k2,2n foo* | tail -10000
$ time python test.py 10000 foo*
test.py:
#!/usr/bin/env python
# test.py
from sys import argv
import heapq
from itertools import chain
# parse N - the size of the heap, and confirm we can open all input files
N = int(argv[1])
streams = [open(f, "r") for f in argv[2:]]
def line_iterator_to_tuple_iterator(line_i):
for line in line_i:
s,c = line.split("\t")
c = int(c)
yield (c, s)
# use heap to process inputs
rez = heapq.nlargest(N,
line_iterator_to_tuple_iterator(chain(*streams)),
key=lambda x: x[0])
for r in rez:
print "%s\t%s" % (r[1], r[0])
for s in streams:
s.close()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With