Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can perl sysopen open file for atomic writes?

While reading the APUE (3rd edition) book, I came across the open system call and its ability to let user open file for write atomic operation with O_APPEND mode meaning that, multiple processes can write to a file descriptor and kernel ensures that the data written to the single file by multiple processes, doesn't overlap and all lines are intact.

Upon experimenting successfully with open system call with a C/C++ program, I was able to validate the same and it works just like the book describes. I was able to launch multiple processes which wrote to a single file and all lines could be accounted for w.r.t to their process PIDs.

I was hoping to observe the same behavior with perl sysopen, as I have some tasks at work which could benefit with this behavior. Tried it out but it actually did not work. When I analyzed the output file, I was able to see signs of race condition (probably) as there many a times interleaved lines.

Question: Is perl sysopen call not same as the linux's open system call? Is it possible to achieve this type of atomic write operation by multiple processes to a single file?

EDIT: adding C code, and perl code used for testing.

C/C++ code

int main(void)
{
  if ((fd = open("outfile.txt",O_WRONLY|O_CREAT|O_APPEND)) == -1) { 
    printf ("failed to create outfile! exiting!\n");
    return -1;
  }

  for (int counter{1};counter<=MAXLINES;counter++)
  { /* write string 'line' for MAXLINES no. of times */
    std::string line = std::to_string(ACE_OS::getpid())
      + " This is a sample data line ";
    line += std::to_string(counter) + " \n";
    if ((n = write(fd,line.c_str(),strlen(line.c_str()))) == -1) {
      printf("Failed to write to outfile!\n";
    }
  }
  return 0;
}

Perl code

#!/usr/bin/perl

use Fcntl;
use strict;
use warnings;

my $maxlines = 100000;

sysopen (FH, "testfile", O_CREAT|O_WRONLY|O_APPEND) or die "failed sysopen\n";
while ($maxlines != 0) {
  print FH "($$) This is sample data line no. $maxlines\n";
  $maxlines--;
}
close (FH);
__END__

Update (after initial troubleshooting):

Thanks to the information provided in the answer below, I was able to get it working. Although I ran into issue of some missing lines which was caused by me opening the file with each process with O_TRUNC, which I shouldn't have done, but missed it initially. After some careful analysis - I spotted the issue and corrected it. As always - linux never fails you :).

Here is a bash script I used to launch the processes:

#!/bin/bash

# basically we spawn "$1" instances of the same 
# executable which should append to the same output file.

max=$1
[[ -z $max ]] && max=6
echo "creating $max processes for appending into same file"

# this is our output file collecting all
# the lines from all the processes.
# we truncate it before we start
>testfile

for i in $(seq 1 $max)
do
    echo $i && ./perl_read_write_with_syscalls.pl 2>>_err & 
done

# end.

Verification from output file:

[compuser@lenovoe470:07-multiple-processes-append-to-a-single-file]$  ls -lrth testfile 
-rw-rw-r--. 1 compuser compuser 252M Jan 31 22:52 testfile
[compuser@lenovoe470:07-multiple-processes-append-to-a-single-file]$  wc -l testfile 
6000000 testfile
[compuser@lenovoe470:07-multiple-processes-append-to-a-single-file]$  cat testfile |cut -f1 -d" "|sort|uniq -c
1000000 (PID: 21118)
1000000 (PID: 21123)
1000000 (PID: 21124)
1000000 (PID: 21125)
1000000 (PID: 21126)
1000000 (PID: 21127)
[compuser@lenovoe470:07-multiple-processes-append-to-a-single-file]$  

Observations:

To my surprise, there wasn't any wait average load on the system - at all. I was not expecting it. I believe somehow kernel must have taken care of that but don't know how it works. I would be interested to know more about it.

What could be the possible applications of this?

I do a lot of file to file reconciliation(s), and we (at work) always have need to parse huge data files (like 30gb - 50gb each). With this working - I could now do parallel operations instead of my previous approach which comprised of: hashing file1, then hashing file2, then compare key,value pairs from 2 files. Now I could do the hashing part in parallel and bring down the time it takes - even further.

Thanks

like image 703
User9102d82 Avatar asked Jan 31 '19 15:01

User9102d82


1 Answers

It doesn't matter if you open or sysopen; the key is using syswrite and sysread instead of print/printf/say/etc and readline/read/eof/etc.

syswrite maps to a single write(2) call, while print/printf/say/etc can result in multiple calls to write(2) (even if autoflush is enabled).[1]

sysread maps to a single read(2) call, while readline/read/eof/etc can result in multiple calls to read(2).

So, by using syswrite and sysread, you are subject to all the assurances that POSIX gives about those calls (whatever they might be) if you're on a POSIX system.


  1. If you use print/printf/say/etc, and limit your writes to less than the size of the buffer between (explicit or automatic) flushes, you'll get a single write(2) call. The buffer size was 4 KiB in older versions of Perl, and it's 8 KiB by default in newer versions of Perl. (The size is decided when perl is built.)
like image 160
ikegami Avatar answered Nov 15 '22 03:11

ikegami