Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Concatenate two huge files in C++

Tags:

c++

stl

I have two std::ofstream text files of a hundred plus megs each and I want to concatenate them. Using fstreams to store the data to create a single file usually ends up with an out of memory error because the size is too big.

Is there any way of merging them faster than O(n)?

File 1 (160MB):

0 1 3 5
7 9 11 13
...
...
9187653 9187655 9187657 9187659 

File 2 (120MB):

a b c d e f g h i j
a b c d e f g h j i
a b c d e f g i h j
a b c d e f g i j h
...
...
j i h g f e d c b a

Merged (380MB):

0 1 3 5
7 9 11 13
...
...
9187653 9187655 9187657 9187659 
a b c d e f g h i j
a b c d e f g h j i
a b c d e f g i h j
a b c d e f g i j h
...
...
j i h g f e d c b a

File generation:

std::ofstream a_file ( "file1.txt" );
std::ofstream b_file ( "file2.txt" );

    while(//whatever){
          a_file << num << endl;
    }

    while(//whatever){
          b_file << character << endl;
    }

    // merge them here, doesn't matter if output is one of them or a new file
    a_file.close();
    b_file.close();
like image 365
MLProgrammer-CiM Avatar asked Oct 24 '13 11:10

MLProgrammer-CiM


4 Answers

Assuming you don't want to do any processing, and just want to concatenate two files to make a third, you can do this very simply by streaming the files' buffers:

std::ifstream if_a("a.txt", std::ios_base::binary);
std::ifstream if_b("b.txt", std::ios_base::binary);
std::ofstream of_c("c.txt", std::ios_base::binary);

of_c << if_a.rdbuf() << if_b.rdbuf();

I have tried this sort of thing with files of up to 100Mb in the past and had no problems. You effectively let C++ and the libraries handle any buffering that's required. It also means that you don't need to worry about file positions if your files get really big.

An alternative is if you just wanted to copy b.txt onto the end of a.txt, in which case you would need to open a.txt with the append flag, and seek to the end:

std::ofstream of_a("a.txt", std::ios_base::binary | std::ios_base::app);
std::ifstream if_b("b.txt", std::ios_base::binary);

of_a.seekp(0, std::ios_base::end);
of_a << if_b.rdbuf();

How these methods work is by passing the std::streambuf of the input streams to the operator<< of the output stream, one of the overrides of which takes a streambuf parameter (operator<<). As mentioned in that link, in the case where there are no errors, the streambuf is inserted unformatted into the output stream until the end of file.

like image 162
icabod Avatar answered Oct 20 '22 19:10

icabod


Is there any way of merging them faster than O(n)?

That would mean you would process the data without passing through it even once. You cannot interpret it for merging without reading it at least once (short answer: no).

For reading the data, you should consider un-buffered reads (look at std::fstream::read).

like image 45
utnapistim Avatar answered Oct 20 '22 21:10

utnapistim


On Windows:-

system ("copy File1+File2 OutputFile");

on Linux:-

system ("cat File1 File2 > OutputFile");

But the answer is simple - don't read the whole file into memory! Read the input files in small blocks:-

void Cat (input_file, output_file)
{
  while ((bytes_read = read_data (input_file, buffer, buffer_size)) != 0)
  { 
    write_data (output_file, buffer, bytes_read);
  }
}

int main ()
{
   output_file = open output file

   input_file = open input file1
   Cat (input_file, output_file)
   close input_file

   input_file = open input file2
   Cat (input_file, output_file)
   close input_file
}
like image 2
Skizz Avatar answered Oct 20 '22 21:10

Skizz


It really depends whether you wish to use "pure" C++ for this, personally at the cost of portability I would be tempted to write:

#include <cstdlib>
#include <sstream>

int main(int argc, char* argv[]) {
    std::ostringstream command;

    command << "cat "; // Linux Only, command for Windows is slightly different

    for (int i = 2; i < argc; ++i) { command << argv[i] << " "; }

    command << "> ";

    command << argv[1];

    return system(command.str().c_str());
}

Is it good C++ code ? No, not really (non-portable and does not escape command arguments).

But it'll get you way ahead of where you are standing now.

As for a "real" C++ solution, with all the ugliness that streams could manage...

#include <fstream>
#include <string>

static size_t const BufferSize = 8192; // 8 KB

void appendFile(std::string const& outFile, std::string const& inFile) {
    std::ofstream out(outFile, std::ios_base::app |
                               std::ios_base::binary |
                               std::ios_base::out);

    std::ifstream in(inFile, std::ios_base::binary |
                             std::ios_base::in);

    std::vector<char> buffer(BufferSize);
    while (in.read(&buffer[0], buffer.size())) {
        out.write(&buffer[0], buffer.size());
    }

    // Fails when "read" encounters EOF,
    // but potentially still writes *some* bytes to buffer!
    out.write(&buffer[0], in.gcount());
}
like image 2
Matthieu M. Avatar answered Oct 20 '22 19:10

Matthieu M.