Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spreadsheet::ParseExcel::Stream losing its parser

Tags:

excel

perl

I have an 18M Excel spreadsheet to parse and Spreadsheet::ParseExcel was consuming so much memory that it I had to switch to Spreadsheet::ParseExcel::Stream. It works fine on my VM, it works fine on our staging server, but on our production server (configured the same way), I get this error:

Can't call method "transfer" on an undefined value at \
lib/Spreadsheet/ParseExcel/Stream/XLS.pm line 31.

That comes from the following bit of code:

my ($wb, $idx, $row, $col, $cell);
my $tmp = my $handler = sub {
  ($wb, $idx, $row, $col, $cell) = @_;
  $parser->transfer($main);  XXX here's where we die
};

my $tmp_p = $parser = Coro::State->new(sub {
  $xls->Parse($file);
  # Flag the generator that we're done
  undef $xls;
  # If we don't transfer back when done parsing,
  # it's an implicit program exit (oops!)
  $parser->transfer($main)
});
weaken($parser);

The weaken looked suspicious, so I tried not weakening unless the refcount was greater than 1, but the same problem happens. I instrumented the code to get a stacktrace and got this:

parser is undefined at lib/Spreadsheet/ParseExcel/Stream/XLS.pm line 29.

Spreadsheet::ParseExcel::Stream::XLS::__ANON__                   \
  ('Spreadsheet::ParseExcel::Workbook=HASH(0x6cd4a08)', 0, 2, 1, \
  'Spreadsheet::ParseExcel::Cell=HASH(0x1387ce78)') called at    \
  /usr/share/perl5/Spreadsheet/ParseExcel.pm line 2152
Spreadsheet::ParseExcel::_NewCell(                               \ 
  'Spreadsheet::ParseExcel::Workbook=HASH(0x6cd4a08)', 2, 1,     \
  'Kind', 'PackedIdx', 'Val', 'Dean', 'FormatNo', 25, ...)       \
   called at /usr/share/perl5/Spreadsheet/ParseExcel.pm line 896
Spreadsheet::ParseExcel::_subLabelSST(                           \
  'Spreadsheet::ParseExcel::Workbook=HASH(0x6cd4a08)', 253, 10,  \
  '\x{2}\x{0}\x{1}\x{0}\x{19}\x{0}2\x{0}\x{0}\x{0}')             \
   called at /usr/share/perl5/Spreadsheet/ParseExcel.pm line 292
Spreadsheet::ParseExcel::parse(                                  \
  'Spreadsheet::ParseExcel=HASH(0x6cd1810)', '2013-09-13.xls')   \
   called at lib/Spreadsheet/ParseExcel/Stream/XLS.pm line 35
Spreadsheet::ParseExcel::Stream::XLS::__ANON__                   \
   called at new_importer.pl line 0

That tells me that the parser read the first and second rows, but it dies on the third row for some reason.

I've tried rebuilding Spreadsheet::ParseExcel::Stream and it doesn't appear to have any errors (all tests pass). I've also recompiled Coro (same result).

I'm mystified. Anyone have any ideas?

like image 591
Ovid Avatar asked Oct 01 '13 17:10

Ovid


1 Answers

The problem turned out to be rather strange and looked like this psuedo code:

stream1 = open first excel stream
sheet1  = stream1.sheet // get spreadsheet ready for reading

if in verbose mode:
    stream2 = open second excel stream
    sheet2  = stream2.sheet
    count++ while sheet2.get_row
    say "We have $count records"

We discovered that if and only if we were in verbose mode would this problem manifest. By having two streams pointing to the same document, our production code would fail, though this worked fine on other boxes. By counting the number of rows and closing that stream before opening the regular stream for reading the document, we solved the problem.

like image 50
Ovid Avatar answered Nov 10 '22 20:11

Ovid