Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently write a large sequence of NULL bytes in a file?

I have an file descriptor fd, an offset and a length, and I need to write length NULL bytes from offset in the file described by fd (note: it never occurs at the end of the file).

Is there an efficient way to do that aside from using a buffer filled with NULLs and repeatedly writing it in a loop? The sequence of NULLs may goes up to 16Mo and I currently use a buffer of size 512 (= ~30k calls to write(2)).

like image 314
bfontaine Avatar asked Nov 07 '13 15:11

bfontaine


People also ask

How do you write nulls bytes?

However, in Modified UTF-8 the null character is encoded as two bytes: 0xC0, 0x80. This allows the byte with the value of zero, which is now not used for any character, to be used as a string terminator.

Can a file contain a null character?

These plain ascii files are invariably terminated with a null character (character number 0 in the ascii set). Binary files, which can contain characters from the extended ascii set of 256 characters, can contain multiple null characters.

What is null byte extension?

Null byte is a bypass technique for sending data that would be filtered otherwise. It relies on injecting the null byte characters ( %00 , \x00 ) in the supplied data. Its role is to terminate a string.


2 Answers

You could try mmaping the file at the desired offset and mapping in exactly the required size, and then simply calling memset.

EDIT: Based on the code posted by @jthill, here is a simple example which demonstrates how to do a comparison..

#define _GNU_SOURCE
#include <unistd.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <sys/time.h>
#include <fcntl.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

void create(int fsize)
{
  FILE *fd = fopen("data", "wb");
  fseek(fd, fsize - 1, SEEK_SET);
  fputc(0, fd);
  fclose(fd);
}

void seek_write(const char* data, int wsize, int seek, int dsize)
{
  int fd = open("data", O_RDWR);
  // Now seek_write
  if (lseek(fd, seek, SEEK_SET) != seek)
    perror("seek?"), abort();
  // Now write in requested blocks..
  for (int c = dsize / wsize; c--;)
    if (write(fd, data, wsize) != wsize)
      perror("write?"), abort();
  close(fd);
}

void mmap_memset(int wsize, int seek, int dsize)
{
  int fd = open("data", O_RDWR);
  void* map = mmap(0, dsize + seek, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
  if (map == MAP_FAILED)
    perror("mmap?"), abort();
  memset((char*)map + seek, 0, dsize);
  munmap(map, dsize);
  close(fd);
}

int main(int c, char **v)
{
  struct timeval start, end;
  long long ts1, ts2;
  int wsize = c>1 ? atoi(*++v) : 512;
  int seek  = c>2 ? atoi(*++v) : 0;
  int reps  = c>3 ? atoi(*++v) : 1000;
  int dsize = c>4 ? atoi(*++v) : 16*1024*1024;
  int fsize = c>5 ? atoi(*++v) : 32*1024*1024;

  // Create the file and grow...
  create(fsize);

  char *data = mmap(0, wsize, PROT_READ, MAP_ANON | MAP_PRIVATE, 0, 0);

  printf("Starting write...\n");
  gettimeofday(&start, NULL);
  for (int i = 0;i < reps; ++i)
    seek_write(data, wsize, seek, dsize);
  gettimeofday(&end, NULL);

  ts1 = ((end.tv_sec - start.tv_sec) * 1000000) + (end.tv_usec - start.tv_usec);

  printf("Starting mmap...\n");
  gettimeofday(&start, NULL);
  for (int i = 0;i < reps; ++i)
    mmap_memset(wsize, seek, dsize);
  gettimeofday(&end, NULL);

  ts2 = ((end.tv_sec - start.tv_sec) * 1000000) + (end.tv_usec - start.tv_usec);

  printf("write: %lld us, %f us\nmmap: %lld us, %f us", ts1, (double)ts1/reps, ts2, (double)ts2/reps);
}

NOTES: mmap doesn't like it if the offset provided is not aligned (typically on a page boundary), so, its possibly nicer if you can map in the length + offset and simply set from the offset (or alternatively, if you can guarantee a nicely aligned offset, this will work too..)

As you can see, the differences between the two operations are the lseek (map + seek) and then the write (memset). I think this is a fair comparison (if anyone wants to fix anything, feel free to.)

I also use MAP_SHARED rather than MAP_PRIVATE, there is a significant difference between the two, the latter does copy-on-write, which can be much slower!

On my not so powerful system, I get:

> ./fwrite 4096 1234
> Starting write...
> Starting mmap...
> write: 14767898 us, 14767.898000 us
> mmap: 6619623 us, 6619.623000 us

I think that shows that mmap + memset is quicker?

like image 50
Nim Avatar answered Oct 17 '22 17:10

Nim


If you are running Linux and the filesystem supports sparse files, you could try to punch a hole in your file using fallocate(2) with the FALLOC_FL_PUNCH_HOLE flag. I would expect that to be fast, although I didn't test it.

like image 39
user2719058 Avatar answered Oct 17 '22 16:10

user2719058