Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

very fast text file processing (C++)

i wrote an application which processes data on the GPU. Code works well, but i have the problem that the reading part of the input file (~3GB, text) is the bottleneck of my application. (The read from the HDD is fast, but the processing line by line is slow).

I read a line with getline() and copy line 1 to a vector, line2 to a vector and skip lines 3 and 4. And so on for the rest of the 11 mio lines.

I tried several approaches to get the file at the best time possible:

Fastest method I found is using boost::iostreams::stream

Others were:

  • Read the file as gzip, to minimize IO, but is slower than directly reading it.
  • copy file to ram by read(filepointer, chararray, length) and process it with a loop to distinguish the lines (also slower than boost)

Any suggestions how to make it run faster?

void readfastq(char *filename, int SRlength, uint32_t blocksize){
    _filelength = 0; //total datasets (each 4 lines)
    _SRlength = SRlength; //length of the 2. line
    _blocksize = blocksize;

    boost::iostreams::stream<boost::iostreams::file_source>ins(filename);
    in = ins;

    readNextBlock();
}


void readNextBlock() {
    timeval start, end;
    gettimeofday(&start, 0);

    string name;
    string seqtemp;
    string garbage;
    string phredtemp;

    _seqs.empty();
    _phred.empty();
    _names.empty();
    _filelength = 0;

            //read only a part of the file i.e the first 4mio lines
    while (std::getline(in, name) && _filelength<_blocksize) {
        std::getline(in, seqtemp);
        std::getline(in, garbage);
        std::getline(in, phredtemp);

        if (seqtemp.size() != _SRlength) {
            if (seqtemp.size() != 0)
                printf("Error on read in fastq: size is invalid\n");
        } else {
            _names.push_back(name);

            for (int k = 0; k < _SRlength; k++) {

                //handle special letters
                                    if(seqtemp[k]== 'A') ...
                                    else{
                _seqs.push_back(5);
                                    }

            }
            _filelength++;
        }
    }

EDIT:

The source-file is downloadable under https://docs.google.com/open?id=0B5bvyb427McSMjM2YWQwM2YtZGU2Mi00OGVmLThkODAtYzJhODIzYjNhYTY2

I changed the function readfastq to read the file, because of some pointer problems. So if you call readfastq the blocksize (in lines) must be bigger than the number of lines to read.

SOLUTION:

I found a solution, which get the time for read in the file from 60sec to 16sec. I removed the inner-loop which handeles the special characters and do this in GPU. This decreases the read-in time and only minimal increases the GPU running time.

Thanks for your suggestions.

void readfastq(char *filename, int SRlength) {
    _filelength = 0;
    _SRlength = SRlength;

    size_t bytes_read, bytes_expected;

    FILE *fp;
    fp = fopen(filename, "r");

    fseek(fp, 0L, SEEK_END); //go to the end of file
    bytes_expected = ftell(fp); //get filesize
    fseek(fp, 0L, SEEK_SET); //go to the begining of the file

    fclose(fp);

    if ((_seqarray = (char *) malloc(bytes_expected/2)) == NULL) //allocate space for file
        err(EX_OSERR, "data malloc");


    string name;
    string seqtemp;
    string garbage;
    string phredtemp;

    boost::iostreams::stream<boost::iostreams::file_source>file(filename);


    while (std::getline(file, name)) {
        std::getline(file, seqtemp);
        std::getline(file, garbage);
        std::getline(file, phredtemp);

        if (seqtemp.size() != SRlength) {
            if (seqtemp.size() != 0)
                printf("Error on read in fastq: size is invalid\n");
        } else {
            _names.push_back(name);

            strncpy( &(_seqarray[SRlength*_filelength]), seqtemp.c_str(), seqtemp.length()); //do not handle special letters here, do on GPU

            _filelength++;
        }
    }
}
like image 498
mic Avatar asked Nov 14 '11 14:11

mic


3 Answers

First instead of reading the file into memory you may work with file mappings. You just have to build your program as 64-bit to fit 3GB of virtual address space (for 32-bit application only 2GB is accessible in the user mode). Or alternatively you may map & process your file by parts.

Next, it sounds to me that your bottleneck is "copying a line to a vector". Dealing with vectors involves dynamic memory allocation (heap operations), which in a critical loop hits the performance very seriously). If this is the case - either avoid using vectors, or make sure they're declared outside the loop. The latter helps because when you reallocate/clear vectors they do not free memory.

Post your code (or a part of it) for more suggestions.

EDIT:

It seems that all your bottlenecks are related to string management.

  • std::getline(in, seqtemp); reading into an std::string deals with the dynamic memory allocation.
  • _names.push_back(name); This is even worse. First the std::string is placed into the vector by value. Means - the string is copied, hence another dynamic allocation/freeing happens. Moreover, when eventually the vector is internally reallocated - all the contained strings are copied again, with all the consequences.

I recommend using neither standard formatted file I/O functions (Stdio/STL) nor std::string. To achieve better performance you should work with pointers to strings (rather than copied strings), which is possible if you map the entire file. Plus you'll have to implement the file parsing (division into lines).

Like in this code:

class MemoryMappedFileParser
{
    const char* m_sz;
    size_t m_Len;

public:

    struct String {
        const char* m_sz;
        size_t m_Len;
    };

    bool getline(String& out)
    {
        out.m_sz = m_sz;

        const char* sz = (char*) memchr(m_sz, '\n', m_Len);
        if (sz)
        {
            size_t len = sz - m_sz;

            m_sz = sz + 1;
            m_Len -= (len + 1);

            out.m_Len = len;

            // for Windows-format text files remove the '\r' as well
            if (len && '\r' == out.m_sz[len-1])
                out.m_Len--;
        } else
        {
            out.m_Len = m_Len;

            if (!m_Len)
                return false;

            m_Len = 0;
        }

        return true;
    }

};
like image 86
valdo Avatar answered Nov 11 '22 11:11

valdo


if _seqs and _names are std::vectors and you can guess the final size of them before processing the whole 3GB of data, you can use reserve to avoid most of the memory re-allocation during pushing back the new elements in the loop.

You should be aware of the fact that the vectors effectively produce another copy of parts of the file in main memory. So unless you have a main memory sufficiently large to store the text file plus the vector and its contents, you will probably end up with a number of page faults that also have a significant influence on the speed of your program.

like image 36
moooeeeep Avatar answered Nov 11 '22 13:11

moooeeeep


You are apparently using <stdio.h> since using getline.

Perhaps fopen-ing the file with fopen(path, "rm"); might help, because the m tells (it is a GNU extension) to use mmap for reading.

Perhaps setting a big buffer (i.e. half a megabyte) with setbuffer could also help.

Probably, using the readahead system call (in a separate thread perhaps) could help.

But all this are guesses. You should really measure things.

like image 2
Basile Starynkevitch Avatar answered Nov 11 '22 11:11

Basile Starynkevitch