Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to subclass IO::Handle to properly get a low level file handle without having a file or memory?

I have an app which accesses a PostgreSQL database and needs to read some large binary data out of it depending on some needed processing. This might be hundreds of MB or even some GB of data. Please no discussion about using file systems instead or such, it's the way it is now.

That data is simply files of various types, e.g. it might be a Zip container or some other kind of archive. Some of the needed processing is list the contents of the Zip, maybe even extract some members for further processing, maybe hash the stored data... In the end the data is read multiple times, but written only once to store it.

All of the Perl libs I use are able to work with file handles, some with IO::Handle, others with IO::String or IO::Scalar, some others only with low level file handles. So what I've done is create a subclass of IO::Handle and IO::Seekable which acts like a wrapper for the corresponding methods around DBD::Pg. In the CTOR I create a connection to the database, open some provided LOID for reading and store the handle provided by Postgres in the instance. My own handle object is then forwarded to whoever is able to work with such a file handle and can directly read and seek within the blob provided by Postgres.

The problem is libs which use low level file handles or low level file handle operations on IO::Handle. Digest::MD5 seems to be one, Archive::Zip another one. Digest::MD5 croaks and tells me that no handle has been provided, Archive::Zip on the other hand tries to create a new, own handle from mine, calls IO::Handle::fdopen and fails in my case.

sub fdopen {
    @_ == 3 or croak 'usage: $io->fdopen(FD, MODE)';
    my ($io, $fd, $mode) = @_;
    local(*GLOB);

    if (ref($fd) && "".$fd =~ /GLOB\(/o) {
    # It's a glob reference; Alias it as we cannot get name of anon GLOBs
    my $n = qualify(*GLOB);
    *GLOB = *{*$fd};
    $fd =  $n;
    } elsif ($fd =~ m#^\d+$#) {
    # It's an FD number; prefix with "=".
    $fd = "=$fd";
    }

    open($io, _open_mode_string($mode) . '&' . $fd)
    ? $io : undef;
}

I guess the problem is the low level copy of the handle, which removes my own instance, so there's no instance anymore having my database connection and all that stuff.

So, is it even possible in my case to provide some IO::Handle which successfully can be used wherever a low level file handle is expected?

I mean, I don't have a real file handle, I have an object only where method calls are wrapped to their corresponding Postgres methods, for which a database handle is needed and such. All of that data needs to be stored somewhere, the wrapping needs to be done etc.

I tried to do what others are doing, like IO::String, which additionally uses tie for example. But in the end that use case is different, because Perl is able to create a real low level file handle to some internal memory on its own. Something which is not supported at all in my case. I need to keep my instance around, because only that knows of the handle to the database etc.

Using my handle like an IO::Handle by calling method read and such works like expected, but I would like to take it a bit further and be more compatible to whoever doesn't expect to work on IO::Handle objects. Much like IO::String or File::Temp can be used as low level file handles.

package ReadingHandle;

use strict;
use warnings;
use 5.10.1;

use base 'IO::Handle', 'IO::Seekable';

use Carp ();

sub new
{
  my $invocant  = shift || Carp::croak('No invocant given.');
  my $db        = shift || Carp::croak('No database connection given.');
  my $loid      = shift // Carp::croak('No LOID given.');
  my $dbHandle  = $db->_getHandle();
  my $self      = $invocant->SUPER::new();

    *$self->{'dbHandle'}  = $dbHandle;
    *$self->{'loid'}      = $loid;
  my $loidFd              = $dbHandle->pg_lo_open($loid, $dbHandle->{pg_INV_READ});
    *$self->{'loidFd'}    = $loidFd;

  if (!defined($loidFd))
  {
    Carp::croak("The provided LOID couldn't be opened.");
  }

  return $self;
}

sub DESTROY
{
  my $self = shift || Carp::croak('The method needs to be called with an instance.');

  $self->close();
}

sub _getDbHandle
{
  my $self = shift || Carp::croak('The method needs to be called with an instance.');

  return *$self->{'dbHandle'};
}

sub _getLoid
{
  my $self = shift || Carp::croak('The method needs to be called with an instance.');

  return *$self->{'loid'};
}

sub _getLoidFd
{
  my $self = shift || Carp::croak('The method needs to be called with an instance.');

  return *$self->{'loidFd'};
}

sub binmode
{
  my $self = shift || Carp::croak('The method needs to be called with an instance.');

  return 1;
}

sub close
{
  my $self      = shift || Carp::croak('The method needs to be called with an instance.');
  my $dbHandle  = $self->_getDbHandle();
  my $loidFd    = $self->_getLoidFd();

  return $dbHandle->pg_lo_close($loidFd);
}

sub opened
{
  my $self    = shift || Carp::croak('The method needs to be called with an instance.');
  my $loidFd  = $self->_getLoidFd();

  return defined($loidFd) ? 1 : 0;
}

sub read
{
  my $self    = shift || Carp::croak('The method needs to be called with an instance.');
  my $buffer  =\shift // Carp::croak('No buffer given.');
  my $length  = shift // Carp::croak('No amount of bytes to read given.');
  my $offset  = shift || 0;

  if ($offset > 0)
  {
    Carp::croak('Using an offset is not supported.');
  }

  my $dbHandle  = $self->_getDbHandle();
  my $loidFd    = $self->_getLoidFd();

  return $dbHandle->pg_lo_read($loidFd, $buffer, $length);
}

sub seek
{
  my $self    = shift || Carp::croak('The method needs to be called with an instance.');
  my $offset  = shift // Carp::croak('No offset given.');
  my $whence  = shift // Carp::croak('No whence given.');

  if ($offset < 0)
  {
    Carp::croak('Using a negative offset is not supported.');
  }
  if ($whence != 0)
  {
    Carp::croak('Using a whence other than 0 is not supported.');
  }

  my $dbHandle  = $self->_getDbHandle();
  my $loidFd    = $self->_getLoidFd();
  my $retVal    = $dbHandle->pg_lo_lseek($loidFd, $offset, $whence);
     $retVal    = defined($retVal) ? 1 : 0;

  return $retVal;
}

sub tell
{
  my $self      = shift || Carp::croak('The method needs to be called with an instance.');
  my $dbHandle  = $self->_getDbHandle();
  my $loidFd    = $self->_getLoidFd();
  my $retVal    = $dbHandle->pg_lo_lseek($loidFd);
     $retVal    = defined($retVal) ? $retVal : -1;

  return $retVal;
}

1;
like image 632
Thorsten Schöning Avatar asked Nov 09 '22 06:11

Thorsten Schöning


1 Answers

There is a way around this, but it is a bit weird. Your requirements are basically threefold, if I'm reading your code and comments correctly:

  1. Work like a normal file handle/IO::Handle object as much as possible, make the fact that it's not a real file invisible to the user.
  2. Work with Archive::Zip, which is implemented mostly in regular Perl, and which calls the IO::Handle::fdopen code you posted, which fails to duplicate the handle since it's not a real handle.
  3. Work with Digest::MD5, which is implemented in XS using PerlIO. Since tie-based tricks and perl in-memory "fake" filehandles are not usable at that level, it's tricker than 2.

You can achieve all three of those by using PerlIO layers with PerlIO::via. The code is similar to what you'd write with tie (implement some required behavior methods). Additionally, you can harness the "open variable as file" functionality of open and the pre-rolled IO::Seekable + IO::Handle functionality of IO::File to simplify achieving requirement 1 above (make it usable in Perl code in the same way normal IO::Handle objects are).

Below is a sample package that does what you need. It has a few caveats:

  • It doesn't extend your code or interact with a DB at all; it just uses a supplied lines arrayref as file data. If this seems like it fits your use case, you should adapt it to work with a DB.
  • It implements the bare minimum necessary to work for the below demo usages. You'll need to implement a lot more methods to make it "well behaved" in most non-demo cases (e.g. it knows nothing of SEEK, EOF, BINMODE, SEEK, et. al). Be aware that the arguments/expected behavior of the functions you'll be implementing is not the same as what you'd do for tie or Tie::Handle; the "interface" have the same names, but different contracts.
  • All methods that receive the invocant should not use it as a hashref/globref directly; they should track all custom state in the *$self->{args} glob field. This is because the blessed object is created twice (once blessed by PerlIO and once by SUPER::new), so state needs to be shared through a shared reference. If you replace the args field or add/remove any other fields, they will only be visible to the set of methods that created them: either the PerlIO methods or the "normal" object methods. See the comment in the constructor for more info.
  • PerlIO in general is not super easy to introspect. If something fails underneath a low-level operation like sysread or <$fh>, a lot of code will bug out or do unexpected things, since it considers those functions to be unable to die/atomic-ish at the operation level. Similarly, when messing with PerlIO it's easy for failure modes to escape the realm of "die or return an error value" and end up in the realm of "segfault or core dump", especially if multiple processes (fork()) or threads are involved (these weird case are, for example, why the below module isn't implemented around IO::File->new; followed by $file->open(... "via:<($class)"); it core dumps for me, no idea why). TL;DR debugging why stuff goes wrong at the PerlIO level can be annoying, you were warned :)
  • Any XS code that addresses a raw filehandle or doesn't work through the PerlIO perlapi functions will not honor this. There are unfortunately plenty of those, but usually not in common, well-supported CPAN modules. Basically, Digest::MD5 doesn't work with tied handles because it's operating at a level "below" tie's magic; PerlIO is one level "lower" than that, but there is yet another level below.
  • This code is a bit messy, and could certainly be cleaned up. In particular, it would probably be quite a bit nicer to open() the layered object directly, skip all the weird pseudo-indirect-object stuff, and then wrap it in an IO::Handle some other way, e.g. via IO::Wrap.
  • PerlIO doesn't work, or works differently, on many much older Perls.

Package:

package TiedThing;

use strict;
use warnings;
use parent "IO::File";

our @pushargs;
sub new {
    my ( $class, $args ) = @_;
    # Build a glob to be used by the PerlIO methods. This does two things:
    # 1. Gets us a place to stick a shared hashref so PerlIO methods and user-
    # -defined object methods can manipulate the same data. They must use the
    # {args} glob field to do that; new fields written will .
    # 2. Unifies the ways of addressing that across custom functions and PerlIO
    # functions. We could just pass a hashref { args => $args } into PUSHED, but
    # then we'd have to remember "PerlIO functions receive a blessed hashref,
    # custom functions receive a blessed glob" which is lame.
    my $glob = Symbol::gensym();
    *$glob->{args} = $args;
    local @pushargs = ($glob, $class);
    my $self = $class->SUPER::new(\my $unused, "<:via($class)");
    *$self->{args} = $args;
    return $self;
}

sub custom {
    my $self = shift;
    return *$self->{args}->{customvalue};
}

sub PUSHED { return bless($pushargs[0], $pushargs[1]); }

sub FILL { return shift(@{*$_[0]->{args}->{lines}}); }

1;

Example usage:

my $object = TiedThing->new({
    lines => [join("\n", 1..9, 1..9)],
    customvalue => "custom!",
});
say "can call custom method: " . $object->custom;
say "raw read with <>: " . <$object>;
my $buf;
read($object, $buf, 10);
say "raw read with read(): " . $buf;
undef $buf;
$object->read($buf, 10);
say "OO read via IO::File::read (end): " . $buf;
my $checksummer = Digest::MD5->new;;
$checksummer->addfile($object);
say "Md5 read: " . $checksummer->hexdigest;
my $dupto = IO::Handle->new;
# Doesn't break/return undef; still not usable without implementing
# more state sharing inside the object.
say "Can dup handle: " . $dupto->fdopen($object, "r");

my $archiver = Archive::Zip->new;
# Dies, but long after the fdopen() call. Can be fixed by implementing more
# PerlIO methods.
$archiver->readFromFileHandle($object);
like image 51
Zac B Avatar answered Nov 15 '22 06:11

Zac B