Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to merge 2 big files [closed]

Tags:

file

linux

merge

Suppose I have 2 files with size of 100G each. And I want to merge them into one, and then delete them. In linux we can use

cat file1 file2 > final_file

But that needs to read 2 big files, and then write a bigger file. Is it possible just append one file to the other, so that no IO is required? Since metadata of file contains the location of the file, and the length, I am wondering whether it is possible to change the metadata of the file to do the merge, so no IO will happen.

like image 224
Daniel Wu Avatar asked Nov 16 '12 05:11

Daniel Wu


People also ask

Can you merge more than 2 files?

To choose the merge option, click the arrow next to the Merge button and select the desired merge option. Once complete, the files are merged. If there are multiple files you want to merge at once, you can select multiple files by holding down the Ctrl and selecting each file you want to merge.


1 Answers

Can you merge two files without writing one file onto the other?

Only in obscure theory. Since disk storage is always based on blocks and filesystems therefore store things on block boundaries, you could only append one file to another without rewriting if the first file ended perfectly on a block boundary. There are some rare filesystem configurations that use tail packing, but that would only help if the first file where already using the tail block of the previous file.

Unless that perfect scenario occurs or your filesystem is able to mark a partial block in the middle of the file (I've never heard of this), this won't work. Just to kick the edge case around, there's also no way outside of changing the kernel interace to make such a call (re: Link to a specific inode)

Can we make this better than doubling the size of both files?

Yes, we can use the append (>>) operation instead.

cat file2 >> file1

That will still result in using all the space of consumed by file2 twice over until we can delete it.

Can we avoid using extra space?

No. Unless somebody comes back with something I don't know, you're basically out of luck there. It's possible to truncate a file, forgetting about the existence of the end of it, but there is no way to forget about the existence of the start unless we get back to modifying inodes directly and having to alter the kernel interface to the filesystem since that's definitely not a a POSIX operation.

What about writing a little bit at a time, then deleting what we wrote?

No again. Since we can't chop the start of a file off, we'd have to rewrite everything from the point of interest all the way to the end of the file. This would be very costly for IO and only useful after we've already read half the file.

What about sparse files?

Maybe! Sparse file allow us to store a long string of zeroes without using up nearly that much space. If we were to read file2 in large chunks starting at the end, we could write those blocks to the end of file1. file1 would immediately look (and read) as if it were the same size as both, but it would be corrupted until we were done because everything we hadn't written would be full of zeroes.

Explaining all this is another answer in itself, but if you can do a spare allocation, you would be able to use only your chunk read size + a little bit extra in disk space to perform this operation. For a reference talking about sparse blocks in the middle of files, see http://lwn.net/Articles/357767/ or do a search involving the term, SEEK_HOLE.

Why is this "maybe" instead of "yes"? Two parts: you'd have to write your own tool (at least we're on the right site for that), and sparse files are not universally respected by file systems and other processes alike. Fortunately you probably won't have to worry about other processes respecting your file, but you will have to worry about setting the right flags and making sure your filesystem is amenable. Last of all, you'll still be reading and re-writing the length of file2, which isn't what you want. This method does mean you can append with just a small amount of disk space, though, rather at using at least 2*file2 amount of space.

like image 68
Jeff Ferland Avatar answered Sep 29 '22 10:09

Jeff Ferland