Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I cat multiple files together into one without intermediary file? [closed]

Tags:

Here is the problem I'm facing:

  • I am string processing a text file ~100G in size.
  • I'm trying to improve the runtime by splitting the file into many hundreds of smaller files and processing them in parallel.
  • In the end I cat the resulting files back together in order.

The file read/write time itself takes hours, so I would like to find a way to improve the following:

cat file1 file2 file3 ... fileN >> newBigFile 
  1. This requires double the diskspace as file1 ... fileN takes up 100G, and then newBigFile takes another 100Gb, and then file1... fileN gets removed.

  2. The data is already in file1 ... fileN, doing the cat >> incurs read and write time when all I really need is for the hundreds of files to reappear as 1 file...

like image 432
Wing Avatar asked Nov 01 '10 19:11

Wing


People also ask

How do I put multiple files into one cat?

Type the cat command followed by the file or files you want to add to the end of an existing file. Then, type two output redirection symbols ( >> ) followed by the name of the existing file you want to add to.

Can you cat multiple files at once?

The cat command is a very popular and versatile command in the 'nix ecosystem. There are 4 common usages of the cat command. It can display a file, concatenate (combine) multiple files, echo text, and it can be used to create a new file.

How do I cat all files in a folder?

You can use the * character to match all the files in your current directory. cat * will display the content of all the files.

How do I combine multiple files into one in Linux?

To append content after you merge multiple files in Linux to another file, use double redirection operator. (>>) along with cat command. Rather than overwriting the contents of the file, this command appends the content at the end of the file.


2 Answers

If you don't need random access into the final big file (i.e., you just read it through once from start to finish), you can make your hundreds of intermediate files appear as one. Where you would normally do

$ consume big-file.txt 

instead do

$ consume <(cat file1 file2 ... fileN) 

This uses Unix process substitution, sometimes also called "anonymous named pipes."

You may also be able to save time and space by splitting your input and doing the processing at the same time; GNU Parallel has a --pipe switch that will do precisely this. It can also reassemble the outputs back into one big file, potentially using less scratch space as it only needs to keep number-of-cores pieces on disk at once. If you are literally running your hundreds of processes at the same time, Parallel will greatly improve your efficiency by letting you tune the amount of parallelism to your machine. I highly recommend it.

like image 172
Jay Hacker Avatar answered Sep 20 '22 18:09

Jay Hacker


When concatenating files back together, you could delete the small files as they get appended:

for file in file1 file2 file3 ... fileN; do   cat "$file" >> bigFile && rm "$file" done 

This would avoid needing double the space.

There is no other way of magically making files magically concatenate. The filesystem API simply doesn't have a function that does that.

like image 39
Robie Basak Avatar answered Sep 23 '22 18:09

Robie Basak