how to merge 2 big files [closed]

Tags:

Suppose I have 2 files with size of 100G each. And I want to merge them into one, and then delete them. In linux we can use

cat file1 file2 > final_file

But that needs to read 2 big files, and then write a bigger file. Is it possible just append one file to the other, so that no IO is required? Since metadata of file contains the location of the file, and the length, I am wondering whether it is possible to change the metadata of the file to do the merge, so no IO will happen.

224

asked Nov 16 '12 05:11

Daniel Wu

1 Answers

Can you merge two files without writing one file onto the other?

Only in obscure theory. Since disk storage is always based on blocks and filesystems therefore store things on block boundaries, you could only append one file to another without rewriting if the first file ended perfectly on a block boundary. There are some rare filesystem configurations that use tail packing, but that would only help if the first file where already using the tail block of the previous file.

Unless that perfect scenario occurs or your filesystem is able to mark a partial block in the middle of the file (I've never heard of this), this won't work. Just to kick the edge case around, there's also no way outside of changing the kernel interace to make such a call (re: Link to a specific inode)

Can we make this better than doubling the size of both files?

Yes, we can use the append (>>) operation instead.

cat file2 >> file1

That will still result in using all the space of consumed by file2 twice over until we can delete it.

Can we avoid using extra space?

No. Unless somebody comes back with something I don't know, you're basically out of luck there. It's possible to truncate a file, forgetting about the existence of the end of it, but there is no way to forget about the existence of the start unless we get back to modifying inodes directly and having to alter the kernel interface to the filesystem since that's definitely not a a POSIX operation.

What about writing a little bit at a time, then deleting what we wrote?

No again. Since we can't chop the start of a file off, we'd have to rewrite everything from the point of interest all the way to the end of the file. This would be very costly for IO and only useful after we've already read half the file.

What about sparse files?

Maybe! Sparse file allow us to store a long string of zeroes without using up nearly that much space. If we were to read file2 in large chunks starting at the end, we could write those blocks to the end of file1. file1 would immediately look (and read) as if it were the same size as both, but it would be corrupted until we were done because everything we hadn't written would be full of zeroes.

Explaining all this is another answer in itself, but if you can do a spare allocation, you would be able to use only your chunk read size + a little bit extra in disk space to perform this operation. For a reference talking about sparse blocks in the middle of files, see http://lwn.net/Articles/357767/ or do a search involving the term, SEEK_HOLE.

Why is this "maybe" instead of "yes"? Two parts: you'd have to write your own tool (at least we're on the right site for that), and sparse files are not universally respected by file systems and other processes alike. Fortunately you probably won't have to worry about other processes respecting your file, but you will have to worry about setting the right flags and making sure your filesystem is amenable. Last of all, you'll still be reading and re-writing the length of file2, which isn't what you want. This method does mean you can append with just a small amount of disk space, though, rather at using at least 2*file2 amount of space.

answered Sep 29 '22 10:09

Jeff Ferland

Related questions
                            
                                How to find the main function's entry point of elf executable file without any symbolic information?
                            
                                How does pthread implemented in linux kernel 3.2?
                            
                                passing grep into a variable in bash
                            
                                Rename files in multiple directories to the name of the directory
                            
                                Filesystem test suites
                            
                                Tool for creating a Java daemon service on Linux [closed]
                            
                                linux script that monitors file changes within folders (like autospec does!)
                            
                                How to figure out if a file is a link?
                            
                                How to load a custom module at the boot time in Ubuntu?
                            
                                Signal queuing in C
                            
                                SIGPIPE, Broken pipe
                            
                                How to make python3.2 interpreter the default interpreter in debian
                            
                                count lines by hour
                            
                                How to install boost on gnu/linux
                            
                                Raspberry-pi docker error: standard_init_linux.go:178: exec user process caused "exec format error"
                            
                                ChromeOS: error: system does not fully support snapd: cannot mount squashfs image using "squashfs": mount:
                            
                                How to avoid race condition when using a lock-file to avoid two instances of a script running simultaneously?
                            
                                recv() is not interrupted by a signal in multithreaded environment
                            
                                Automatic detection of display availability with matplotlib
                            
                                Linux select() vs ppoll() vs pselect()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to merge 2 big files [closed]

Tags:

file

linux

merge