Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Prepend to Very Large File in Fixed Time or Very Fast [closed]

Tags:

linux

centos7

I have a file that is very large (>500GB) that I want to prepend with a relatively small header (<20KB). Doing commands such as:

cat header bigfile > tmp
mv tmp bigfile

or similar commands (e.g., with sed) are very slow.

What is the fastest method of writing a header to the beginning of an existing large file? I am looking for a solution that can run under CentOS 7.2. It is okay to install packages from CentOS install or updates repo, EPEL, or RPMForge.

It would be great if some method exists that doesn't involve relocating or copying the large amount of data in the bigfile. That is, I'm hoping for a solution that can operate in fixed time for a given header file regardless of the size of the bigfile. If that is too much to ask for, then I'm just asking for the fastest method.

Compiling a helper tool (as in C/C++) or using a scripting language is perfectly acceptable.

like image 527
Steve Amerige Avatar asked Jun 17 '16 13:06

Steve Amerige


1 Answers

Is this something that needs to be done once, to "fix" a design oversight perhaps? Or is it something that you need to do on a regular basis, for instance to add summary data (for instance, the number of data records) to the beginning of the file?

If you need to do it just once then your best option is just to accept that a mistake has been made and take the consequences of the retro-fix. As long as you make your destination drive different from the source drive you should be able to fix up a 500GB file within about two hours. So after a week of batch processes running after hours you could have upgraded perhaps thirty or forty files

If this is a standard requirement for all such files, and you think you can apply the change only when the file is complete -- some sort of summary information perhaps -- then you should reserve the space at the beginning of each file and leave it empty. Then it is a simple matter of seeking into the header region and overwriting it with the real data once it can be supplied

As has been explained, standard file systems require the whole of a file to be copied in order to add something at the beginning

If your 500GB file is on a standard hard disk, which will allow data to be read at around 100MB per second, then reading the whole file will take 5,120 seconds, or roughly 1 hour 30 minutes

As long as you arrange for the destination to be a separate drive from the source, your can mostly write the new file in parallel with the read, so it shouldn't take much longer than that. But there's no way to speed it up other than that, I'm afraid

like image 142
Borodin Avatar answered Oct 01 '22 07:10

Borodin