Fast concatenate multiple files on Linux

Tags:

I am using Python multiprocessing to generate a temporary output file per process. They can be several GBs in size and I make several tens of these. These temporary files need to be concated to form the desired output and this is the step that is proving to be a bottleneck (and a parallelism killer). Is there a Linux tool that will create the concated file by modifying the file-system meta-data and not actually copy the content ? As long as it works on any Linux system that would be acceptable to me. But a file system specific solution wont be of much help.

I am not OS or CS trained, but in theory it seems it should be possible to create a new inode and copy over the inode pointer structure from the inode of the files I want to copy from, and then unlink those inodes. Is there any utility that will do this ? Given the surfeit of well thought out unix utilities I fully expected it to be, but could not find anything. Hence my question on SO. The file system is on a block device, a hard disk actually, in case this information matters. I dont have the confidence to write this on my own, as I have never done any systems level programming before, so any pointers (to C/Python code snipppets) will be very helpful.

754

asked May 05 '11 06:05

san

2 Answers

Even if there was such a tool, this could only work if the files except the last were guaranteed to have a size that is a multiple of the filesystem's block size.

If you control how the data is written into the temporary files, and you know how large each one will be, you can instead do the following

Before starting the multiprocessing, create the final output file, and grow it to the final size by fseek()ing to the end, this will create a sparse file.
Start multiprocessing, handing each process the FD and the offset into its particular slice of the file.

This way, the processes will collaboratively fill the single output file, removing the need to cat them together later.

EDIT

If you can't predict the size of the individual files, but the consumer of the final file can work with sequential (as opposed to random-access) input, you can feed cat tmpfile1 .. tmpfileN to the consumer, either on stdin

Click to copy

cat tmpfile1 ... tmpfileN | consumer

or via named pipes (using bash's Process Substitution):

Click to copy

consumer <(cat tmpfile1 ... tmpfileN)

125

answered Sep 20 '22 14:09

Marc Mutz - mmutz

You indicate that you don't know in advance the size of each temporary file. With this in mind, I think your best bet is to write a FUSE filesystem that would present the chunks as a single large file, while keeping them as individual files on the underlying filesystem.

In this solution, your producing and consuming apps remain unchanged. The producers write out a bunch of files that the FUSE layer makes appear as a single file. This virtual file is then presented to the consumer.

FUSE has bindings for a bunch of languages, including Python. If you look at some examples here or here (these are for different bindings), this requires surprisingly little code.

answered Sep 16 '22 14:09

NPE

Related questions
                            
                                Compiling Cx-Freeze under Ubuntu
                            
                                mono mcs 'Winforms Hello World' gives compile error CS006: Metadata file 'cscompmgd.dll' could not be found
                            
                                top: 'include' filter delimiter is missing
                            
                                Ansible ad-hoc command with direct host specified - no hosts matched
                            
                                Tracing memory corruption on a production linux server
                            
                                linux gedit: I always get "GConf Error: failed to contact configuration server ..."
                            
                                How to do an atomic increment and fetch in C?
                            
                                swt browser No more handles Error
                            
                                trying to import a module: undefined symbol: PyUnicodeUCS4_DecodeUTF8
                            
                                Rename multiple directories matching pattern
                            
                                Does Socket IO involve Disk IO?
                            
                                How to limit file size on commit?
                            
                                Compilers: Understanding assembly code generated from small programs
                            
                                Hexdump reverse command
                            
                                Pattern match does not work in bash script
                            
                                Free/Open h.264 video decoding libraries? (Non-GPL)
                            
                                Intercept WM_DELETE_WINDOW on X11?
                            
                                What do I need to write a small game on Linux?
                            
                                Get mouse deltas using Python! (in Linux)
                            
                                Linux memory reporting discrepancy [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fast concatenate multiple files on Linux

Tags:

linux

copy

parallel-processing

cat

san

People also ask

2 Answers

Marc Mutz - mmutz

NPE

Recent Activity

Donate For Us