Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing disk IO

Tags:

c

io

optimization

I have a piece of code that analyzes streams of data from very large (10-100GB) binary files. It works well, so it's time to start optimizing, and currently disk IO is the biggest bottleneck.

There are two types of files in use. The first type of file consists of a stream of 16-bit integers, which must be scaled after I/O to convert to a floating point value which is physically meaningful. I read the file in chunks, and I read in the chunks of data by reading one 16-bit code at a time, performing the required scaling, and then storing the result in an array. Code is below:

int64_t read_current_chimera(FILE *input, double *current,
                             int64_t position, int64_t length, chimera *daqsetup)
{
    int64_t test;
    uint16_t iv;

    int64_t i;
    int64_t read = 0;

    if (fseeko64(input, (off64_t)position * sizeof(uint16_t), SEEK_SET))
    {
        return 0;
    }

    for (i = 0; i < length; i++)
    {
        test = fread(&iv, sizeof(uint16_t), 1, input);
        if (test == 1)
        {
            read++;
            current[i] = chimera_gain(iv, daqsetup);
        }
        else
        {
            perror("End of file reached");
            break;
        }
    }
    return read;
}

The chimera_gain function just takes a 16-bit integer, scales it and returns the double for storage.

The second file type contains 64-bit doubles, but it contains two columns, of which I only need the first. To do this I fread pairs of doubles and discard the second one. The double must also be endian-swapped before use. The code I use to do this is below:

int64_t read_current_double(FILE *input, double *current, int64_t position, int64_t length)
{
    int64_t test;
    double iv[2];

    int64_t i;
    int64_t read = 0;

    if (fseeko64(input, (off64_t)position * 2 * sizeof(double), SEEK_SET))
    {
        return 0;
    }

    for (i = 0; i < length; i++)
    {
        test = fread(iv, sizeof(double), 2, input);
        if (test == 2)
        {
            read++;
            swapByteOrder((int64_t *)&iv[0]);
            current[i] = iv[0];
        }
        else
        {
            perror("End of file reached: ");
            break;
        }
    }
    return read;
}

Can anyone suggest a method of reading these file types that would be significantly faster than what I am currently doing?

like image 446
KBriggs Avatar asked Aug 19 '16 16:08

KBriggs


People also ask

What is considered high disk IO?

Symptoms of high disk IO High server load — The average system load exceeds 1 . chkservd notifications — You receive notifications about an offline service or that the system cannot restart a service. Slow hosted websites — Hosted websites may require more than a minute to load.

What causes high disk IO?

You may have an application that is trying to retrieve data from a poorly structured database and is taking longer to pull data from the database than it would take for the disk to read it. You may have too many applications running on a single server, which causes them to max out available CPU resources and slow down.

What does disk IO mean?

Disk I/O encompasses the input/output operations on a physical disk. If you're reading data from a file on a disk, the processor needs to wait for the file to be read (the same goes for writing).


1 Answers

First off, it would be useful to use a profiler to identify the hot spots in your program. Based on your description of the problem, you have a lot of overhead going on by the sheer number of freads. As the files are large there will be a big benefit to increasing the amount of data you read per io.

Convince yourself of this by putting together 2 small programs that read the stream.

1) read it as you are in the example above, of 2 doubles.

2) read it the same way, but make it 10,000 doubles.

Time both runs a few times, and odds are you will be observe #2 runs much faster.

Best of luck.

like image 54
EvilTeach Avatar answered Sep 26 '22 15:09

EvilTeach