Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Translate PDF file using Google Translate API

I want to use Google Translate in my project. I completed all the formalities with Google. I have the API key also with me. With this key I can easily translate any word with JavaScript. But how to translate the PDF file as we can do in Google Translate site? I found one thing like this:

http://translate.google.com/translate?hl=fr&sl=auto&tl=en&u=http://www.example.com/PDF.pdf

But here I cannot use my key, as a result it takes so much time to translate. So I want to use my Key and translate a PDF file. Please help me out. My approach is like this:

1. One html page I have.
2. One browse button for pdf
3. Upload the file
4. Transalte the pdf with Google API and show in the html page.

I searched it for this pdf translate with but did not find anything. Please help me out.

like image 784
Saikat Avatar asked May 14 '15 04:05

Saikat


Video Answer


1 Answers

TL:DR: Use headless browser to render a PDF from the Google's PDF translation service.

PDF is a complex format and can include many components that are text. To translate it I will describe solution from easy one to more advanced.

Translate raw text

If you only need the translation without the visual output, you can extract the text and give it to Google Translate.

Since you did not provide information on your project (language, environment, ...) I will redirect you to this thread on how to extract text

Translate all text

If you need to get text from everything in your PDF, well that's pretty hard. To avoid headache (partially) you can convert the PDF to an image (using imagemagick tools or similar) and then you have three options:

  • OCR the text from the image, then give it to google, again you are loosing the original form.
  • OCR the text, but saving the position (some libraries can do that, again since you did not specify your project information, see theses links: #1, #2, #3, #4).

    Then translate it with google api, and write the result to the image. For great results you need to take account of text font, color and background color. Pretty difficult, but feasible.

  • Translate the image using google translate image service. Unfortunately this feature is not available in the public API, so unless doing some reverse engineering, this is not possible.

Translate using Google's PDF translation service

The solution you provide by using the translate site can be automated quite easily. The reason it's long is because it is an heavy process and you probably won't beat Google.

Using an headless browser, you can get the translation page with your pdf, then observe that the translated content is sitting in an iframe, get that iframe and finally print to PDF.

Here is a short example using SlimerJS (should be compatible for Phantomjs)

var page = require("webpage").create();

// here you may want to setup page size and options    

// get the page
page.open('https://translate.google.fr/translate?hl=fr&sl=en&u=http://example.com/pdf-sample.pdf', function(status) {
    if (status !== 'success') {
        console.log('Unable to access network');
    } else {
        // find the iframe with querySelector
        var iframe_src = page.evaluate(function() {
            return document.querySelector('#contentframe').querySelector('iframe').src;
        });

        console.log('Found iframe: ' + iframe_src);

        // render the iframe
        page.open(iframe_src, function(status) {
            // wait a bit for javascript to translate
            // this can be optimized to be triggered in javascript when translation is done
            setTimeout(function() {
                // print the page into PDF
                page.render('/tmp/test.pdf', { format: 'pdf' });

                phantom.exit(0);
            }, 2000);

        });
    }
});

Giving this file: http://www.cbu.edu.zm/downloads/pdf-sample.pdf
It produce this result (translated in French): (I posted a screenshot since I cannot embed PDF ;) ) Pdf result

like image 183
Cyrbil Avatar answered Oct 11 '22 15:10

Cyrbil