Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

phantomjs pdf to stdout

I am desperately trying to output a PDF generated by phantomJS to stdout like here

What I am getting is an empty PDF file, although it is not 0 in size, it displays a blank page.

var page = require('webpage').create(),
system = require('system'),
address;

address = system.args[1];
page.paperSize = {format: 'A4'};

page.open(address, function (status) {
    if (status !== 'success') {
        console.log('Unable to load the address!');
        phantom.exit();
    } else {
        window.setTimeout(function () {
            page.render('/dev/stdout', { format: 'pdf' });
            phantom.exit();
        }, 1000);
    }
});

And I call it like so: phantomjs rasterize.js http://google.com>test.pdf

I tried changing /dev/stdout to system.stdout but not luck. Writing PDF straight to file works without any problems.

I am looking for a cross-platform implementation, so I hope this is achievable on non-linux systems.

like image 488
michaeltintiuc Avatar asked Oct 22 '13 08:10

michaeltintiuc


2 Answers

When writing output to /dev/stdout/ or /dev/stderr/ on Windows, PhantomJS goes through the following steps (as seen in the render method in \phantomjs\src\webpage.cpp):

  1. In absence of /dev/stdout/ and /dev/stderr/ a temporary file path is allocated.
  2. Call renderPdf with the temporary file path.
  3. Render the web page to this file path.
  4. Read the contents of this file into a QByteArray.
  5. Call QString::fromAscii on the byte array and write to stdout or stderr.
  6. Delete the temporary file.

To begin with, I built the source for PhantomJS, but commented out the file deletion. On the next run, I was able to examine the temporary file it had rendered, which turned out to be completely fine. I also tried running phantomjs.exe rasterize.js http://google.com > test.png with the same results. This immediately ruled out a rendering issue, or anything specifically to do with PDFs, meaning that the problem had to be related to the way data is written to stdout.

By this stage I had suspicions about whether there was some text encoding shenanigans going on. From previous runs, I had both a valid and invalid version of the same file (a PNG in this case).

Using some C# code, I ran the following experiment:

//Read the contents of the known good file.
byte[] bytesFromGoodFile = File.ReadAllBytes("valid_file.png");
//Read the contents of the known bad file.
byte[] bytesFromBadFile = File.ReadAllBytes("invalid_file.png");

//Take the bytes from the valid file and convert to a string
//using the Latin-1 encoding.
string iso88591String = Encoding.GetEncoding("iso-8859-1").GetString(bytesFromGoodFile);
//Take the Latin-1 encoded string and retrieve its bytes using the UTF-8 encoding.
byte[] bytesFromIso88591String = Encoding.UTF8.GetBytes(iso88591String);

//If the bytes from the Latin-1 string are all the same as the ones from the
//known bad file, we have an encoding problem.
Debug.Assert(bytesFromBadFile
    .Select((b, i) => b == bytesFromIso88591String[i])
    .All(c => c));

Note that I used ISO-8859-1 encoding as QT uses this as the default encoding for c-strings. As it turned out, all those bytes were the same. The point of that exercise was to see if I could mimic the encoding steps that caused valid data to become invalid.

For further evidence, I investigated \phantomjs\src\system.cpp and \phantomjs\src\filesystem.cpp.

  • In system.cpp, the System class holds references to, among other things, File objects for stdout, stdin and stderr, which are set up to use UTF-8 encoding.
  • When writing to stdout, the write function of the File object is called. This function supports writing to both text and binary files, but because of the way the System class initializes them, all writing will be treated as though it were going to a text file.

So the problem boils down to this: we need to be performing a binary write to stdout, yet our writes end up being treated as text and having an encoding applied to them that causes the resulting file to be invalid.


Given the problem described above, I can't see any way to get this working the way you want on Windows without making changes to the PhantomJS code. So here they are:

This first change will provide a function we can call on File objects to explicitly perform a binary write.

Add the following function prototype in \phantomjs\src\filesystem.h:

bool binaryWrite(const QString &data);

And place its definition in \phantomjs\src\filesystem.cpp (the code for this method comes from the write method in this file):

bool File::binaryWrite(const QString &data)
{
    if ( !m_file->isWritable() ) {
        qDebug() << "File::write - " << "Couldn't write:" << m_file->fileName();
        return true;
    }

    QByteArray bytes(data.size(), Qt::Uninitialized);
    for(int i = 0; i < data.size(); ++i) {
        bytes[i] = data.at(i).toAscii();
    }
    return m_file->write(bytes);
}

At around line 920 of \phantomjs\src\webpage.cpp you'll see a block of code that looks like this:

    if( fileName == STDOUT_FILENAME ){
#ifdef Q_OS_WIN32
        _setmode(_fileno(stdout), O_BINARY);            
#endif      

        ((File *)system->_stderr())->write(QString::fromAscii(name.constData(), name.size()));

#ifdef Q_OS_WIN32
        _setmode(_fileno(stdout), O_TEXT);
#endif          
    }

Change it to this:

   if( fileName == STDOUT_FILENAME ){
#ifdef Q_OS_WIN32
        _setmode(_fileno(stdout), O_BINARY);
        ((File *)system->_stdout())->binaryWrite(QString::fromAscii(ba.constData(), ba.size()));
#elif            
        ((File *)system->_stderr())->write(QString::fromAscii(name.constData(), name.size()));
#endif      

#ifdef Q_OS_WIN32
        _setmode(_fileno(stdout), O_TEXT);
#endif          
    }

So what that code replacement does is calls our new binaryWrite function, but does so guarded by a #ifdef Q_OS_WIN32 block. I did it this way so as to preserve the old functionality on non-Windows systems which don't seem to exhibit this problem (or do they?). Note that this fix only applies to writing to stdout - if you want to you could always apply it to stderr but it may not matter quite so much in that case.

In case you just want a pre-built binary (who wouldn't?), you can find phantomjs.exe with these fixes on my SkyDrive. My version is around 19MB whereas the one I downloaded earlier was only about 6MB, though I followed the instructions here, so it should be fine.

like image 151
nick_w Avatar answered Nov 06 '22 13:11

nick_w


Yes, that's right ISO-8859-1 is the default encoding for QT so you will need to add the required parameter to the command line --output-encoding=ISO-8859-1 so the pdf output won't be corrupted

i.e.

phantomjs.exe rasterize.js --output-encoding=ISO-8859-1 < input.html > output.pdf

and rasterize.js looks like this (tested, works for both Unix and Windows)

var page = require('webpage').create(),
system = require('system');

page.viewportSize = {width: 600, height: 600};
page.paperSize = {format: 'A4', orientation: system.args[1], margin: '1cm'};

page.content = system.stdin.read();

window.setTimeout(function () {
    try {
        page.render('/dev/stdout', {format: 'pdf'});
    }
    catch (e) {
        console.log(e.message + ';;' + output_file);
    }
    phantom.exit();
}, 1000);

or alternatively you can set encoding using stdout and if you are reading from UTF-8 stream then you might have to set encoding for stdin as well;

system.stdout.setEncoding('ISO-8859-1');
system.stdin.setEncoding('UTF-8');
page.content = system.stdin.read();
like image 26
Pinchy Avatar answered Nov 06 '22 12:11

Pinchy