Here is the problem I'm facing: <ul> <li>I am string processing a text file ~100G in size. </li> <li>I'm trying to improve the runtime by splitting the file into many hundreds of smaller files and processing them in parallel. </li> <li>In the end I cat the resulting files back together in order. </li> </ul> The file read/write time itself takes hours, so I would like to find a way to improve the following: <pre class="prettyprint"><code>cat file1 file2 file3 ... fileN >> newBigFile </code></pre> <ol> <li>This requires double the diskspace as <code>file1</code> ... <code>fileN</code> takes up 100G, and then <code>newBigFile</code> takes another 100Gb, and then <code>file1</code>... <code>fileN</code> gets removed.</li> <li>The data is already in <code>file1</code> ... <code>fileN</code>, doing the <code>cat >></code> incurs read and write time when all I really need is for the hundreds of files to reappear as 1 file...</li> </ol>

When concatenating files back together, you could delete the small files as they get appended: <pre class="prettyprint lang-sh prettyprint-override"><code>for file in file1 file2 file3 ... fileN; do cat "$file" >> bigFile && rm "$file" done </code></pre> This would avoid needing double the space. There is no other way of magically making files magically concatenate. The filesystem API simply doesn't have a function that does that.

How can I cat multiple files together into one without intermediary file? [closed]

Tags:

Here is the problem I'm facing:

I am string processing a text file ~100G in size.
I'm trying to improve the runtime by splitting the file into many hundreds of smaller files and processing them in parallel.
In the end I cat the resulting files back together in order.

The file read/write time itself takes hours, so I would like to find a way to improve the following:

cat file1 file2 file3 ... fileN >> newBigFile

This requires double the diskspace as file1 ... fileN takes up 100G, and then newBigFile takes another 100Gb, and then file1... fileN gets removed.
The data is already in file1 ... fileN, doing the cat >> incurs read and write time when all I really need is for the hundreds of files to reappear as 1 file...

432

asked Nov 01 '10 19:11

Wing

2 Answers

If you don't need random access into the final big file (i.e., you just read it through once from start to finish), you can make your hundreds of intermediate files appear as one. Where you would normally do

$ consume big-file.txt

instead do

$ consume <(cat file1 file2 ... fileN)

This uses Unix process substitution, sometimes also called "anonymous named pipes."

You may also be able to save time and space by splitting your input and doing the processing at the same time; GNU Parallel has a --pipe switch that will do precisely this. It can also reassemble the outputs back into one big file, potentially using less scratch space as it only needs to keep number-of-cores pieces on disk at once. If you are literally running your hundreds of processes at the same time, Parallel will greatly improve your efficiency by letting you tune the amount of parallelism to your machine. I highly recommend it.

172

answered Sep 20 '22 18:09

Jay Hacker

When concatenating files back together, you could delete the small files as they get appended:

for file in file1 file2 file3 ... fileN; do   cat "$file" >> bigFile && rm "$file" done

This would avoid needing double the space.

There is no other way of magically making files magically concatenate. The filesystem API simply doesn't have a function that does that.

answered Sep 23 '22 18:09

Robie Basak

Related questions
                            
                                PHP Framework vs Content Management System
                            
                                Generic List of Generic Interfaces not allowed, any alternative approaches?
                            
                                Multi-Level Includes in CodeFirst - EntityFrameWork
                            
                                TEXT field that is compatible in mysql and hsqldb
                            
                                How to get current page in my WP7 app
                            
                                How to upgrade the ant built into eclipse?
                            
                                Setting up a public (or private) symbol server over http
                            
                                How can I flatten lists without splitting strings?
                            
                                Is it possible to define a C macro in a makefile?
                            
                                C# "Constant Objects" to use as default parameters
                            
                                Dangerous Python Keywords?
                            
                                Django Admin Page missing CSS

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With