Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallelize while loop with OpenMP

I have a very large data file, and each record in this data file has 4 lines. I have written a very simple C program to analyze files of this type and print out some useful information. The basic idea of the program is this.

int main()
{
  char buffer[BUFFER_SIZE];
  while(fgets(buffer, BUFFER_SIZE, stdin))
  {
    fgets(buffer, BUFFER_SIZE, stdin);
    do_some_simple_processing_on_the_second_line_of_the_record(buffer);
    fgets(buffer, BUFFER_SIZE, stdin);
    fgets(buffer, BUFFER_SIZE, stdin);
  }
  print_out_result();
}

This of course leaves out some details (sanity/error checking, etc), but that is not relevant to the question.

The program works fine, but the data files I'm working with are huge. I figured I would try to speed up the program by parallelizing the loop with OpenMP. After a bit of searching, though, it appears that OpenMP can only handle for loops where the number of iterations is know beforehand. Since I don't know the size of the files beforehand, and even simple commands like wc -l take a long time to run, how can I parallelize this program?

like image 608
Daniel Standage Avatar asked Feb 03 '23 13:02

Daniel Standage


1 Answers

As thiton mentioned, this code could be I/O bounded. However, these days many computers may have SSDs and high-throughput RAID disks. In such case, you can get speedup from parallelization. Moreover, if the computation is not trivial, then parallelize wins. Even if the I/O is effectively serialized due to saturated bandwidth, you can still get speedup by distributing the computation to multicore.


Back to the question itself, you can parallelize this loop by OpenMP. With stdin, I have no idea to parallelize because it needs to read sequentially and no prior information of the end. However, if you're working a typical file, you can do it.

Here is my code with omp parallel. I used some Win32 API and MSVC CRT:

void test_io2()
{
  const static int BUFFER_SIZE = 1024;
  const static int CONCURRENCY = 4;

  uint64_t local_checksums[CONCURRENCY];
  uint64_t local_reads[CONCURRENCY];

  DWORD start = GetTickCount();

  omp_set_num_threads(CONCURRENCY);

  #pragma omp parallel
  {
    int tid = omp_get_thread_num();

    FILE* file = fopen("huge_file.dat", "rb");
    _fseeki64(file, 0, SEEK_END);
    uint64_t total_size = _ftelli64(file);

    uint64_t my_start_pos = total_size/CONCURRENCY * tid;
    uint64_t my_end_pos   = min((total_size/CONCURRENCY * (tid + 1)), total_size);
    uint64_t my_read_size = my_end_pos - my_start_pos;
    _fseeki64(file, my_start_pos, SEEK_SET);

    char* buffer = new char[BUFFER_SIZE];

    uint64_t local_checksum = 0;
    uint64_t local_read = 0;
    size_t read_bytes;
    while ((read_bytes = fread(buffer, 1, min(my_read_size, BUFFER_SIZE), file)) != 0 &&
      my_read_size != 0)
    {
      local_read += read_bytes;
      my_read_size -= read_bytes;
      for (int i = 0; i < read_bytes; ++i)
        local_checksum += (buffer[i]);
    }

    local_checksums[tid] = local_checksum;
    local_reads[tid]     = local_read;

    fclose(file);
  }

  uint64_t checksum = 0;
  uint64_t total_read = 0;
  for (int i = 0; i < CONCURRENCY; ++i)
    checksum += local_checksums[i], total_read += local_reads[i];

  std::cout << checksum << std::endl
    << total_read << std::endl
    << double(GetTickCount() - start)/1000. << std::endl;
}

This code looks a bit dirty because I needed to precisely distribute the amount of the file to be read. However, the code is fairly straightforward. One thing keep in mind is that you need to have a per-thread file pointer. You can't simply share a file pointer because the internal data structure may not be thread-safe. Also, this code can be parallelized by parallel for. But, I think this approach is more natural.


Simple experimental results

I have tested this code to read a 10GB file on either a HDD (WD Green 2TB) and a SSD (Intel 120GB).

With a HDD, yes, no speedups were obtained. Even slowdown was observed. This clearly shows that this code is I/O bounded. This code virtually has no computation. Just I/O.

However, with a SSD, I had a speedup of 1.2 with 4 cores. Yes, the speedup is small. But, you still can get it with SSD. And, if the computation becomes a bit more (I just put a very short busy-waiting loop), speedups would be significant. I was able to get speedup of 2.5.


In sum, I'd like to recommend that you try to parallelize this code.

Also, if the computation is not trivial, I would recommend pipelining. The above code simply divides into several big chunks, resulting in poor cache efficiency. However, pipeline parallelization may yield better cache utilization. Try to use TBB for pipeline parallelization. They provide a simple pipeline construct.

like image 58
minjang Avatar answered Feb 11 '23 18:02

minjang