Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast concatenate multiple files on Linux

I am using Python multiprocessing to generate a temporary output file per process. They can be several GBs in size and I make several tens of these. These temporary files need to be concated to form the desired output and this is the step that is proving to be a bottleneck (and a parallelism killer). Is there a Linux tool that will create the concated file by modifying the file-system meta-data and not actually copy the content ? As long as it works on any Linux system that would be acceptable to me. But a file system specific solution wont be of much help.

I am not OS or CS trained, but in theory it seems it should be possible to create a new inode and copy over the inode pointer structure from the inode of the files I want to copy from, and then unlink those inodes. Is there any utility that will do this ? Given the surfeit of well thought out unix utilities I fully expected it to be, but could not find anything. Hence my question on SO. The file system is on a block device, a hard disk actually, in case this information matters. I dont have the confidence to write this on my own, as I have never done any systems level programming before, so any pointers (to C/Python code snipppets) will be very helpful.

like image 754
san Avatar asked May 05 '11 06:05

san


People also ask

How concatenate multiple files in Linux?

Type the cat command followed by the file or files you want to add to the end of an existing file. Then, type two output redirection symbols ( >> ) followed by the name of the existing file you want to add to.

How do I combine multiple text files into one in Linux?

To join two or more text files on the Linux command-line, you can use the cat command. The cat (short for “concatenate”) command is one of the most commonly used commands in Linux as well as other UNIX-like operating systems, used to concatenate files and print on the standard output.

How concatenate multiple files in Unix?

Replace file1 , file2 , and file3 with the names of the files you wish to combine, in the order you want them to appear in the combined document. Replace newfile with a name for your newly combined single file. This command will add file1 , file2 , and file3 (in that order) to the end of destfile .

Which Linux command is used to concatenate the contents of files?

The cat Command The most frequently used command to concatenate files in Linux is probably cat, whose name comes from concatenate.


2 Answers

Even if there was such a tool, this could only work if the files except the last were guaranteed to have a size that is a multiple of the filesystem's block size.

If you control how the data is written into the temporary files, and you know how large each one will be, you can instead do the following

  1. Before starting the multiprocessing, create the final output file, and grow it to the final size by fseek()ing to the end, this will create a sparse file.

  2. Start multiprocessing, handing each process the FD and the offset into its particular slice of the file.

This way, the processes will collaboratively fill the single output file, removing the need to cat them together later.

EDIT

If you can't predict the size of the individual files, but the consumer of the final file can work with sequential (as opposed to random-access) input, you can feed cat tmpfile1 .. tmpfileN to the consumer, either on stdin

cat tmpfile1 ... tmpfileN | consumer

or via named pipes (using bash's Process Substitution):

consumer <(cat tmpfile1 ... tmpfileN)
like image 125
Marc Mutz - mmutz Avatar answered Sep 20 '22 14:09

Marc Mutz - mmutz


You indicate that you don't know in advance the size of each temporary file. With this in mind, I think your best bet is to write a FUSE filesystem that would present the chunks as a single large file, while keeping them as individual files on the underlying filesystem.

In this solution, your producing and consuming apps remain unchanged. The producers write out a bunch of files that the FUSE layer makes appear as a single file. This virtual file is then presented to the consumer.

FUSE has bindings for a bunch of languages, including Python. If you look at some examples here or here (these are for different bindings), this requires surprisingly little code.

like image 41
NPE Avatar answered Sep 16 '22 14:09

NPE