Here is the problem I'm facing:
The file read/write time itself takes hours, so I would like to find a way to improve the following:
cat file1 file2 file3 ... fileN >> newBigFile
This requires double the diskspace as file1
... fileN
takes up 100G, and then newBigFile
takes another 100Gb, and then file1
... fileN
gets removed.
The data is already in file1
... fileN
, doing the cat >>
incurs read and write time when all I really need is for the hundreds of files to reappear as 1 file...
Type the cat command followed by the file or files you want to add to the end of an existing file. Then, type two output redirection symbols ( >> ) followed by the name of the existing file you want to add to.
The cat command is a very popular and versatile command in the 'nix ecosystem. There are 4 common usages of the cat command. It can display a file, concatenate (combine) multiple files, echo text, and it can be used to create a new file.
You can use the * character to match all the files in your current directory. cat * will display the content of all the files.
To append content after you merge multiple files in Linux to another file, use double redirection operator. (>>) along with cat command. Rather than overwriting the contents of the file, this command appends the content at the end of the file.
If you don't need random access into the final big file (i.e., you just read it through once from start to finish), you can make your hundreds of intermediate files appear as one. Where you would normally do
$ consume big-file.txt
instead do
$ consume <(cat file1 file2 ... fileN)
This uses Unix process substitution, sometimes also called "anonymous named pipes."
You may also be able to save time and space by splitting your input and doing the processing at the same time; GNU Parallel has a --pipe switch that will do precisely this. It can also reassemble the outputs back into one big file, potentially using less scratch space as it only needs to keep number-of-cores pieces on disk at once. If you are literally running your hundreds of processes at the same time, Parallel will greatly improve your efficiency by letting you tune the amount of parallelism to your machine. I highly recommend it.
When concatenating files back together, you could delete the small files as they get appended:
for file in file1 file2 file3 ... fileN; do cat "$file" >> bigFile && rm "$file" done
This would avoid needing double the space.
There is no other way of magically making files magically concatenate. The filesystem API simply doesn't have a function that does that.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With