sort across multiple files in linux

Tags:

I have multiple (many) files; each very large:

file0.txt
file1.txt
file2.txt

I do not want to join them into a single file because the resulting file would be 10+ Gigs. Each line in each file contains a 40-byte string. The strings are fairly well ordered right now, (about 1:10 steps is a decrease in value instead of an increase).

I would like the lines ordered. (in-place if possible?) This means some of the lines from the end of file0.txt will be moved to the beginning of file1.txt and vice versa.

I am working on Linux and fairly new to it. I know about the sort command for a single file, but am wondering if there is a way to sort across multiple files. Or maybe there is a way to make a pseudo-file made from smaller files that linux will treat as a single file.

What I know can do: I can sort each file individually and read into file1.txt to find the value larger than the largest in file0.txt (and similarly grab the lines from the end of file0.txt), join and then sort.. but this is a pain and assumes no values from file2.txt belong in file0.txt (however highly unlikely in my case)

Edit

To be clear, if the files look like this:

Click to copy

f0.txt
DDD
XXX
AAA

f1.txt
BBB
FFF
CCC

f2.txt
EEE
YYY
ZZZ

I want this:

Click to copy

f0.txt
AAA
BBB
CCC

f1.txt
DDD
EEE
FFF

f2.txt
XXX
YYY
ZZZ

972

asked Oct 07 '11 22:10

Paul

1 Answers

I don't know about a command doing in-place sorting, but I think a faster "merge sort" is possible:

Click to copy

for file in *.txt; do
    sort -o $file $file
done
sort -m *.txt | split -d -l 1000000 - output

The sort in the for loop makes sure the content of the input files is sorted. If you don't want to overwrite the original, simply change the value after the -o parameter. (If you expect the files to be sorted already, you could change the sort statement to "check-only": sort -c $file || exit 1)
The second sort does efficient merging of the input files, all while keeping the output sorted.
This is piped to the split command which will then write to suffixed output files. Notice the - character; this tells split to read from standard input (i.e. the pipe) instead of a file.

Also, here's a short summary of how the merge sort works:

sort reads a line from each file.
It orders these lines and selects the one which should come first. This line gets sent to the output, and a new line is read from the file which contained this line.
Repeat step 2 until there are no more lines in any file.
At this point, the output should be a perfectly sorted file.
Profit!

answered Nov 07 '22 19:11

JBert

Related questions
                            
                                List only duplicate lines based on one column from a semi-colon delimited file?
                            
                                install play-framework in Ubuntu 9.10
                            
                                Keeping a copy of a file in a same directory [closed]
                            
                                What is the easiest way to detect key presses in python 3 on a linux machine?
                            
                                How to resize root partition online , on xfs filesystem?
                            
                                How to read the last line of a text file into a variable using Bash? [closed]
                            
                                Linux: Removing files that don't contain all the words specified
                            
                                ICMP sockets (linux)
                            
                                How to convert m4v and wmv videos to mp4 format using ffmpeg?
                            
                                Counting lines starting with a certain word
                            
                                Merge pdf files with numerical sort
                            
                                linux dlopen: can a library be "notified" when it is loaded?
                            
                                whiptail: How to redirect output to environment variable?
                            
                                How to get extension of a file in shell script
                            
                                Simple linux console text editor wanted [closed]
                            
                                Problems with PHPUnit (Linux) - PHP Fatal Error
                            
                                Error: Could not mmap file: vmlinux
                            
                                How to sort files in some directory by the names on Linux
                            
                                Linux Program can't find Shared Library at run-time
                            
                                Disable network manager for a particular interface

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

sort across multiple files in linux

Tags:

file

linux

sorting

Edit

Paul

People also ask

1 Answers

JBert

Recent Activity

Donate For Us