I have multiple (many) files; each very large:
file0.txt
file1.txt
file2.txt
I do not want to join them into a single file because the resulting file would be 10+ Gigs. Each line in each file contains a 40-byte string. The strings are fairly well ordered right now, (about 1:10 steps is a decrease in value instead of an increase).
I would like the lines ordered. (in-place if possible?) This means some of the lines from the end of file0.txt
will be moved to the beginning of file1.txt
and vice versa.
I am working on Linux and fairly new to it. I know about the sort
command for a single file, but am wondering if there is a way to sort across multiple files. Or maybe there is a way to make a pseudo-file made from smaller files that linux will treat as a single file.
What I know can do:
I can sort each file individually and read into file1.txt
to find the value larger than the largest in file0.txt
(and similarly grab the lines from the end of file0.txt
), join and then sort.. but this is a pain and assumes no values from file2.txt
belong in file0.txt
(however highly unlikely in my case)
To be clear, if the files look like this:
f0.txt
DDD
XXX
AAA
f1.txt
BBB
FFF
CCC
f2.txt
EEE
YYY
ZZZ
I want this:
f0.txt
AAA
BBB
CCC
f1.txt
DDD
EEE
FFF
f2.txt
XXX
YYY
ZZZ
Sort a File Numerically To sort a file containing numeric data, use the -n flag with the command. By default, sort will arrange the data in ascending order. If you want to sort in descending order, reverse the arrangement using the -r option along with the -n flag in the command.
To sort file contents numerically, use the -n option with sort. This option is useful only if the lines in your files start with numbers. Keep in mind that, in the default order, “02” would be considered smaller than "1". Use the -n option when you want to ensure that lines are sorted in numeric order.
In the Linux system, you will find one command named sort. This command can sort your data alphabetically. Here flag -k is used to select a field for sorting.
I don't know about a command doing in-place sorting, but I think a faster "merge sort" is possible:
for file in *.txt; do
sort -o $file $file
done
sort -m *.txt | split -d -l 1000000 - output
sort
in the for loop makes sure the content of the input files is sorted. If you don't want to overwrite the original, simply change the value after the -o
parameter. (If you expect the files to be sorted already, you could change the sort statement to "check-only": sort -c $file || exit 1
)sort
does efficient merging of the input files, all while keeping the output sorted.split
command which will then write to suffixed output files. Notice the -
character; this tells split to read from standard input (i.e. the pipe) instead of a file.Also, here's a short summary of how the merge sort works:
sort
reads a line from each file.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With