How to parallelize reading lines from an input file when lines get independently processed?

Question

I just started off with OpenMP using C++. My serial code in C++ looks something like this:

#include <iostream>
#include <string>
#include <sstream>
#include <vector>
#include <fstream>
#include <stdlib.h>

int main(int argc, char* argv[]) {
    string line;
    std::ifstream inputfile(argv[1]);

    if(inputfile.is_open()) {
        while(getline(inputfile, line)) {
            // Line gets processed and written into an output file
        }
    }
}

Because each line is pretty much independently processed, I was attempting to use OpenMP to parallelize this because the input file is in the order of gigabytes. So I'm guessing that first I need to get the number of lines in the input file and then parallelize the code this way. Can someone please help me out here?

#include <iostream>
#include <string>
#include <sstream>
#include <vector>
#include <fstream>
#include <stdlib.h>

#ifdef _OPENMP
#include <omp.h>
#endif

int main(int argc, char* argv[]) {
    string line;
    std::ifstream inputfile(argv[1]);

    if(inputfile.is_open()) {
        //Calculate number of lines in file?
        //Set an output filename and open an ofstream
        #pragma omp parallel num_threads(8)
        {
            #pragma omp for schedule(dynamic, 1000)
            for(int i = 0; i < lines_in_file; i++) {
                 //What do I do here? I cannot just read any line because it requires random access
            }
        }
    }
}

EDIT:

Important Things

Each line is independently processed
Order of the results don't matter

Nikolai Fetissov · Accepted Answer

Not a direct OpenMP answer - but what you are probably looking for is Map/Reduce approach. Take a look at Hadoop - it's done in Java, but there's some C++ API at least.

In general, you want to process this amount of data on different machines, not in multiple threads in the same process (virtual address space limitations, lack of physical memory, swapping, etc.) Also the kernel will have to bring the disk file in sequentially anyway (which you want - otherwise the hard-drive will just have to do extra seeks for each of your threads).

How to parallelize reading lines from an input file when lines get independently processed?

Tags:

c++

parallel-processing

openmp

Legend

1 Answers

Nikolai Fetissov

Recent Activity

Donate For Us

How to parallelize reading lines from an input file when lines get independently processed?

Tags:

c++

parallel-processing

openmp

Legend

1 Answers

Nikolai Fetissov

Related questions

Recent Activity

Donate For Us