Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

reading and writing in chunks on linux using c

Tags:

c

linux

I have a ASCII file where every line contains a record of variable length. For example

Record-1:15 characters
Record-2:200 characters
Record-3:500 characters
...
...
Record-n: X characters

As the file sizes is about 10GB, i would like to read the record in chunks. Once read, i need to transform them, write them into another file in binary format.

So, for reading, my first reaction was to create a char array such as

FILE *stream; 
char buffer[104857600]; //100 MB char array
fread(buffer, sizeof(buffer), 104857600, stream);
  1. Is it correct to assume, that linux will issue one system call and fetch the entire 100MB?
  2. As the records are separated by new line, i search for character by character for a new line character in the buffer and reconstruct each record.

My question is that is this how i should read in chunks or is there a better alternative to read data in chunks and reconstitute each record? Is there an alternative way to read x number of variable sized lines from an ASCII file in one call ?

Next during write, i do the same. I have a write char buffer, which i pass to fwrite to write a whole set of records in one call.

fwrite(buffer, sizeof(buffer), 104857600, stream);

UPDATE: If i setbuf(stream, buffer), where buffer is my 100MB char buffer, would fgets return from buffer or cause a disk IO?

like image 556
Jimm Avatar asked May 10 '12 03:05

Jimm


2 Answers

  1. Yes, fread will fetch the entire thing at once. (Assuming it's a regular file.) But it won't read 105 MB unless the file itself is 105 MB, and if you don't check the return value you have no way of knowing how much data was actually read, or if there was an error.

  2. Use fgets (see man fgets) instead of fread. This will search for the line breaks for you.

    char linebuf[1000];
    FILE *file = ...;
    while (fgets(linebuf, sizeof(linebuf), file) {
        // decode one line
    }
    
  3. There is a problem with your code.

    char buffer[104857600]; // too big
    

    If you try to allocate a large buffer (105 MB is certainly large) on the stack, then it will fail and your program will crash. If you need a buffer that big, you will have to allocate it on the heap with malloc or similar. I'd certainly keep stack usage for a single function in the tens of KB at most, although you could probably get away with a few MB on most stock Linux systems.

As an alternative, you could just mmap the entire file into memory. This will not improve or degrade performance in most cases, but it easier to work with.

int r, fdes;
struct stat st;
void *ptr;
size_t sz;

fdes = open(filename, O_RDONLY);
if (fdes < 0) abort();
r = fstat(fdes, &st);
if (r) abort();
if (st.st_size > (size_t) -1) abort(); // too big to map
sz = st.st_size;
ptr = mmap(NULL, sz, PROT_READ, MAP_SHARED, fdes, 0);
if (ptr == MAP_FAILED) abort();
close(fdes); // file no longer needed

// now, ptr has the data, sz has the data length
// you can use ordinary string functions

The advantage of using mmap is that your program won't run out of memory. On a 64-bit system, you can put the entire file into your address space at the same time (even a 10 GB file), and the system will automatically read new chunks as your program accesses the memory. The old chunks will be automatically discarded, and re-read if your program needs them again.

It's a very nice way to plow through large files.

like image 185
Dietrich Epp Avatar answered Nov 20 '22 22:11

Dietrich Epp


If you can, you might find that mmaping the file will be easiest. mmap maps a (portion of a) file into memory so the whole file can be accessed essentially as an array of bytes. In your case, you might not be able to map the whole file at once it would look something like:

#include <stdio.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#include <sys/mman.h>


/* ... */

struct stat stat_buf;
long pagesz = sysconf(_SC_PAGESIZE);
int fd = fileno(stream);
off_t line_start = 0;
char *file_chunk = NULL;
char *input_line;
off_t cur_off = 0;
off_t map_offset = 0;
/* map 16M plus pagesize to ensure any record <= 16M will always fit in the mapped area */
size_t map_size = 16*1024*1024+pagesz;
if (map_offset + map_size > stat_buf.st_size) {
  map_size = stat_buf.st_size - map_offset;
}
fstat(fd, &stat_buf);
/* map the first chunk of the file */
file_chunk = mmap(NULL, map_size, PROT_READ, MAP_SHARED, fd, map_offset);
// until we reach the end of the file
while (cur_off < stat_buf.st_size) {
  /* check if we're about to read outside the current chunk */
  if (!(cur_off-map_offset < map_size)) {
    // destroy the previous mapping
    munmap(file_chunk, map_size);
    // round down to the page before line_start
    map_offset = (line_start/pagesz)*pagesz;
    // limit mapped region to size of file
    if (map_offset + map_size > stat_buf.st_size) {
      map_size = stat_buf.st_size - map_offset;
    }
    // map the next chunk
    file_chunk = mmap(NULL, map_size, PROT_READ, MAP_SHARED, fd, map_offset);
    // adjust the line start for the new mapping
    input_line = &file_chunk[line_start-map_offset];
  }
  if (file_chunk[cur_off-map_offset] == '\n') {
    // found a new line, process the current line
    process_line(input_line, cur_off-line_start);
    // set up for the next one
    line_start = cur_off+1;
    input_line = &file_chunk[line_start-map_offset];
  }
  cur_off++;
}

Most of the complication is to avoid making too huge a mapping. You might be able to map the whole file using

char *file_data = mmap(NULL, stat_buf.st_size, PROT_READ, MAP_SHARED, fd, 0);
like image 2
Geoff Reedy Avatar answered Nov 20 '22 21:11

Geoff Reedy