C++

Question

How does one read and split/chunk a file by the number of lines?

I would like to partition a file into separate buffers, while ensuring that a line is not split up between two or more buffers. I plan on passing these buffers into their own pthreads so they can perform some type of simultaneous/asynchronous processing.

I've read the answer below reading and writing in chunks on linux using c but I don't think it exactly answers the question about making sure that a line is not split up into two or more buffers.

Dietmar Kühl · Accepted Answer

How is the file encoded? If it each byte represents a character, I would do the following:

Memory map the file using mmap().
Tell the jobs their approximate start and end by computing it based on an appropriate chunk size.
Have each job find its actual start and end by finding the next ' '.
Process the respective chunks concurrently.
Note that the first chunk needs special treatment because its start isn't approximate but exact.

Omnifarious · Answer

I would choose a chunk size in bytes. Then I would seek to the appropriate location in the file and read some smallish number of bytes at a time until I got a newline.

The first chunk's last character is the newline. The second chunk's first character is the character after the newline.

Always seek to a pagesize() boundary and read in pagesize() bytes at a time to search for your newline. This will tend to ensure that you only pull the minimum necessary from disk to find your boundaries. You could try reading like 128 bytes at a time or something. But you then risk making more system calls.

I wrote an example program that does this for letter frequency counting. This, of course, is largely pointless to split into threads as it's almost certainly IO bound. And it also doesn't matter where the newlines are because it isn't line oriented. But, it's just an example. Also, it's heavily reliant on you having a reasonably complete C++11 implementation.

threaded_file_split.cpp on lisp.paste.org

They key function is this:

// Find the offset of the next newline given a particular desired offset.
off_t next_linestart(int fd, off_t start)
{
   using ::std::size_t;
   using ::ssize_t;
   using ::pread;

   const size_t bufsize = 4096;
   char buf[bufsize];

   for (bool found = false; !found;) {
      const ssize_t result = pread(fd, buf, bufsize, start);
      if (result < 0) {
         throw ::std::system_error(errno, ::std::system_category(),
                                   "Read failure trying to find newline.");
      } else if (result == 0) {
         // End of file
         found = true;
      } else {
         const char * const nl_loc = ::std::find(buf, buf + result, '
');
         if (nl_loc != (buf + result)) {
            start += ((nl_loc - buf) + 1);
            found = true;
         } else {
            start += result;
         }
      }
   }
   return start;
}

Also notice that I use pread. This is absolutely essential when you have multiple threads reading from different parts of the file.

The file descriptor is a shared resource between your threads. When one thread reads from the file using ordinary functions it alters a detail about this shared resource, the file pointer. The file pointer is the position in the file at which the next read will occur.

Simply using lseek before you read each time does not help this because it introduces a race condition between the lseek and the read.

The pread function allows you to read a bunch of bytes from a specific location within the file. It also doesn't alter the file pointer at all. Apart from the fact that it doesn't alter the file pointer, it's otherwise like combining an lseek and a read in the same call.

C++ - How to chunk a file for simultaneous/async processing?

Tags:

asynchronous

multithreading

pthreads

Ken

2 Answers

Dietmar Kühl

Omnifarious

Recent Activity

Donate For Us