Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parallelize reading lines from an input file when lines get independently processed?

I just started off with OpenMP using C++. My serial code in C++ looks something like this:

#include <iostream>
#include <string>
#include <sstream>
#include <vector>
#include <fstream>
#include <stdlib.h>

int main(int argc, char* argv[]) {
    string line;
    std::ifstream inputfile(argv[1]);

    if(inputfile.is_open()) {
        while(getline(inputfile, line)) {
            // Line gets processed and written into an output file
        }
    }
}

Because each line is pretty much independently processed, I was attempting to use OpenMP to parallelize this because the input file is in the order of gigabytes. So I'm guessing that first I need to get the number of lines in the input file and then parallelize the code this way. Can someone please help me out here?

#include <iostream>
#include <string>
#include <sstream>
#include <vector>
#include <fstream>
#include <stdlib.h>

#ifdef _OPENMP
#include <omp.h>
#endif

int main(int argc, char* argv[]) {
    string line;
    std::ifstream inputfile(argv[1]);

    if(inputfile.is_open()) {
        //Calculate number of lines in file?
        //Set an output filename and open an ofstream
        #pragma omp parallel num_threads(8)
        {
            #pragma omp for schedule(dynamic, 1000)
            for(int i = 0; i < lines_in_file; i++) {
                 //What do I do here? I cannot just read any line because it requires random access
            }
        }
    }
}

EDIT:

Important Things

  1. Each line is independently processed
  2. Order of the results don't matter
like image 358
Legend Avatar asked Oct 05 '10 01:10

Legend


1 Answers

Not a direct OpenMP answer - but what you are probably looking for is Map/Reduce approach. Take a look at Hadoop - it's done in Java, but there's some C++ API at least.

In general, you want to process this amount of data on different machines, not in multiple threads in the same process (virtual address space limitations, lack of physical memory, swapping, etc.) Also the kernel will have to bring the disk file in sequentially anyway (which you want - otherwise the hard-drive will just have to do extra seeks for each of your threads).

like image 159
Nikolai Fetissov Avatar answered Nov 07 '22 12:11

Nikolai Fetissov