Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

gnu sort - default buffer size

Tags:

gnu-sort

I have read the full documentation for gnu sort and searched online but I cannot find what the default for the --buffer-size option is (which determines how much system memory the program uses when it runs). I am guessing it is somehow determined based on total system memory? (or perhaps on memory available at the time the program is begins execution?). How can I determine this?

update: I've experimented a bit and it seems that when I don't specify a particular --buffer-size value, it ends up using very little ram and thus going very slowly. It would be nice though to better understand what exactly is determining this behavior.

like image 869
Michael Ohlrogge Avatar asked May 29 '16 19:05

Michael Ohlrogge


2 Answers

I went digging through the coreutils sort source code and found these functions: default_sort_size and sort_buffer_size.

It turns out that --buffer-size (sort_size in the source code) isn't the target buffer size but rather the maximum buffer size. If no --buffer-size value is specified, the default_sort_size function is used to determine a safe maximum buffer size. It does this based on resource limits, available memory, and total memory. A summary of the function is as follows:

size = MIN(SIZE_MAX, resource_limit) / 2;
mem  = MAX(available_memory, total_memory / 8);

if ( size > total_memory * 0.75 )
    size = total * 0.75;

buffer_max = MIN(mem, size);
buffer_max = MAX(buffer, MIN_SORT_SIZE);

The other function, sort_buffer_size, is used to determine exactly how much memory to allocate for the given input files. A summary of the function is as follows:

if (sort_size is set)
    size_bound = sort_size;
else
    size_bound = default_sort_size();

buffer_size = line_bytes + 2;

for each input_file
    if (input_file is regular)
        file_size = input_file_size;
    else
        if (sort_size is set)
            return sort_size;
        else
            file_size = guess;

    worst_case = file_size * worst_case_per_input_byte + 1;

    if (worst_case overflows || size + worst_case >= size_bound)
        return size_bound;
    else
        size += worst_case;

return size;

Possibly the most important point of the sort_buffer_size function is that if you're sorting data from STDIN or a pipe, it will automatically default to sort_size (i.e. --buffer-size) if it was provided. Otherwise, for regular files it will make some rough calculations based on the file sizes and only use sort_size as an upper limit.

like image 127
Mr. Llama Avatar answered Nov 15 '22 17:11

Mr. Llama


To summarize in English, the defaults are:

Reading from a real file: Use all free memory, up to 3/4 and not less than 1/8 of total memory.

(If there is a process (rusage) memory limit in effect, sort will not use more than half of that.)

Reading from a pipe: Use a small, fixed amount (tens of MB).
You will probably want -S.

Current for GNU coreutils 8.29, Jan 2018.

like image 27
Doctor J Avatar answered Nov 15 '22 17:11

Doctor J