I am using Python multiprocessing to generate a temporary output file per process. They can be several GBs in size and I make several tens of these. These temporary files need to be concated to form the desired output and this is the step that is proving to be a bottleneck (and a parallelism killer). Is there a Linux tool that will create the concated file by modifying the file-system meta-data and not actually copy the content ? As long as it works on any Linux system that would be acceptable to me. But a file system specific solution wont be of much help.
I am not OS or CS trained, but in theory it seems it should be possible to create a new inode and copy over the inode pointer structure from the inode of the files I want to copy from, and then unlink those inodes. Is there any utility that will do this ? Given the surfeit of well thought out unix utilities I fully expected it to be, but could not find anything. Hence my question on SO. The file system is on a block device, a hard disk actually, in case this information matters. I dont have the confidence to write this on my own, as I have never done any systems level programming before, so any pointers (to C/Python code snipppets) will be very helpful.
Type the cat command followed by the file or files you want to add to the end of an existing file. Then, type two output redirection symbols ( >> ) followed by the name of the existing file you want to add to.
To join two or more text files on the Linux command-line, you can use the cat command. The cat (short for “concatenate”) command is one of the most commonly used commands in Linux as well as other UNIX-like operating systems, used to concatenate files and print on the standard output.
Replace file1 , file2 , and file3 with the names of the files you wish to combine, in the order you want them to appear in the combined document. Replace newfile with a name for your newly combined single file. This command will add file1 , file2 , and file3 (in that order) to the end of destfile .
The cat Command The most frequently used command to concatenate files in Linux is probably cat, whose name comes from concatenate.
Even if there was such a tool, this could only work if the files except the last were guaranteed to have a size that is a multiple of the filesystem's block size.
If you control how the data is written into the temporary files, and you know how large each one will be, you can instead do the following
Before starting the multiprocessing, create the final output file, and grow
it to the final size by
fseek()
ing
to the end, this will create a
sparse file.
Start multiprocessing, handing each process the FD and the offset into its particular slice of the file.
This way, the processes will collaboratively fill the single output file, removing the need to cat them together later.
EDIT
If you can't predict the size of the individual files, but the consumer of the
final file can work with sequential (as opposed to random-access) input, you can
feed cat tmpfile1 .. tmpfileN
to the consumer, either on stdin
cat tmpfile1 ... tmpfileN | consumer
or via named pipes (using bash's Process Substitution):
consumer <(cat tmpfile1 ... tmpfileN)
You indicate that you don't know in advance the size of each temporary file. With this in mind, I think your best bet is to write a FUSE filesystem that would present the chunks as a single large file, while keeping them as individual files on the underlying filesystem.
In this solution, your producing and consuming apps remain unchanged. The producers write out a bunch of files that the FUSE layer makes appear as a single file. This virtual file is then presented to the consumer.
FUSE has bindings for a bunch of languages, including Python. If you look at some examples here or here (these are for different bindings), this requires surprisingly little code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With