Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improving IO performance for merging two files in C

Tags:

performance

c

io

I wrote a function which merges two large files (file1,file2) into a new file (outputFile). Each file is a line based format while entries are separated by \0 byte. Both files have the same amount of null bytes.

One example file with two entries could look like this A\nB\n\0C\nZ\nB\n\0

   Input:
   file1: A\nB\0C\nZ\nB\n\0
   file2: BBA\nAB\0T\nASDF\nQ\n\0
   Output
   outputFile: A\nB\nBBA\nAB\0C\nZ\nB\nT\nASDF\nQ\n\0

FILE * outputFile = fopen(...);
setvbuf ( outputFile  , NULL , _IOFBF , 1024*1024*1024 )
FILE * file1 = fopen(...); 
FILE * file2 = fopen(...); 
int c1, c2;
while((c1=fgetc(file1)) != EOF) {
    if(c1 == '\0'){
        while((c2=fgetc(file2)) != EOF && c2 != '\0') {
            fwrite(&c2, sizeof(char), 1, outputFile);
        }
        char nullByte = '\0';
        fwrite(&nullByte, sizeof(char), 1, outputFile);
    }else{
        fwrite(&c1, sizeof(char), 1, outputFile);
    }
}

Is there a way to improve this IO performance of this function? I increased the buffer size of outputFile to 1 GB by using setvbuf. Would it help to use posix_fadvise on file1 and file2?

like image 642
martin s Avatar asked Aug 09 '16 05:08

martin s


Video Answer


1 Answers

You're doing IO character-by-character. That is going to be needlessly and painfully S-L-O-W, even with buffered streams.

Take advantage of the fact that your data is stored in your files as NUL-terminated strings.

Assuming you're alternating nul-terminated strings from each file, and running on a POSIX platform so you can simply mmap() the input files:

typedef struct mapdata
{
    const char *ptr;
    size_t bytes;
} mapdata_t;

mapdata_t mapFile( const char *filename )
{
    mapdata_t data;
    struct stat sb;

    int fd = open( filename, O_RDONLY );
    fstat( fd, &sb );

    data.bytes = sb.st_size;

    /* assumes we have a NUL byte after the file data 
       If the size of the file is an exact multiple of the
       page size, we won't have the terminating NUL byte! */
    data.ptr = mmap( NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0 );
    close( fd );
    return( data );
}

void unmapFile( mapdata_t data )
{
    munmap( data.ptr, data.bytes );
}

void mergeFiles( const char *file1, const char *file2, const char *output )
{
    char zeroByte = '\0';

    mapdata_t data1 = mapFile( file1 );
    mapdata_t data2 = mapFile( file2 );

    size_t strOffset1 = 0UL;
    size_t strOffset2 = 0UL;

    /* get a page-aligned buffer - a 64kB alignment should work */
    char *iobuffer = memalign( 64UL * 1024UL, 1024UL * 1024UL );

    /* memset the buffer to ensure the virtual mappings exist */
    memset( iobuffer, 0, 1024UL * 1024UL );

    /* use of direct IO should reduce memory pressure - the 1 MB
       buffer is already pretty large, and since we're not seeking
       the page cache is really only slowing things down */
    int fd = open( output, O_RDWR | O_TRUNC | O_CREAT | O_DIRECT, 0644 );

    FILE *outputfile = fdopen( fd, "wb" );
    setvbuf( outputfile, iobuffer, _IOFBF, 1024UL * 1024UL );

    /* loop until we reach the end of either mapped file */
    for ( ;; )
    {
        fputs( data1.ptr + strOffset1, outputfile );
        fwrite( &zeroByte, 1, 1, outputfile );

        fputs( data2.ptr + strOffset2, outputfile );
        fwrite( &zeroByte, 1, 1, outputfile );

        /* skip over the string, assuming there's one NUL
           byte in between strings */
        strOffset1 += 1 + strlen( data1.ptr + strOffset1 );
        strOffset2 += 1 + strlen( data2.ptr + strOffset2 );

        /* if either offset is too big, end the loop */
        if ( ( strOffset1 >= data1.bytes ) ||
             ( strOffset2 >= data2.bytes ) )
        {
            break;
        }
    }

    fclose( outputfile );

    unmapFile( data1 );
    unmapFile( data2 );       
}

I've put in no error checking at all. You'll also need to add the proper header files.

Note also that the file data is assumed to NOT be an exact multiple of the system page size, thus ensuring that there's a NUL byte mapped after the file contents. If the size of the file is an exact multiple of the page size, you'll have to mmap() an additional page after the file contents to ensure that there's a NUL byte to terminate the last string.

Or you can rely on there being a NUL byte as the last byte of the file's contents. If that ever turns out to not be true, you'll likely get either a SEGV or corrupted data.

like image 193
Andrew Henle Avatar answered Oct 09 '22 00:10

Andrew Henle