Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to implement a PDF viewer that loads pages asynchronously

Tags:

android

ios

pdf

We need to allow users of our mobile app to browse a magazine with an experience that is fast, fluid and feels native to the platform (similar to iBooks/Google Books).

Some featurs we need are being able to see Thumbnails of the whole magazine, and searching for specific text.

The problem is that our magazines are over 140 pages long and we can’t force our users to have to fully download the whole ebook/PDF beforehand. We need pages to be loaded asynchronously, that is, to let users start reading without having to fully download the content.

I studied PDFKit for iOS however I didn’t find any mention in the documentation about downloading a PDF asynchronously.

Are there any solutions/libraries to implement this functionality on iOS and Android?

like image 882
lisovaccaro Avatar asked May 06 '18 02:05

lisovaccaro


1 Answers

What you're looking for is called linearization and according to this answer.

The first object immediately after the %PDF-1.x header line shall contain a dictionary key indicating the /Linearized property of the file.

This overall structure allows a conforming reader to learn the complete list of object addresses very quickly, without needing to download the complete file from beginning to end:

  • The viewer can display the first page(s) very fast, before the complete file is downloaded.

  • The user can click on a thumbnail page preview (or a link in the ToC of the file) in order to jump to, say, page 445, immediately after the first page(s) have been displayed, and the viewer can then request all the objects required for page 445 by asking the remote server via byte range requests to deliver these "out of order" so the viewer can display this page faster. (While the user reads pages out of order, the downloading of the complete document will still go on in the background...)

You can use this native library to linearization a PDF.

However I wouldn't recommend made it has rendering the PDFs wont be fast, fluid or feel native. For those reasons, as far as I know there is no native mobile app that does linearization. Moreover, you have to create your own rendering engine for the PDF as most PDF viewing libraries do not support linearization . What you should do instead is convert the each individual page in the PDF to HTML on the server end and have the client only load the pages when required and cache. We will also save PDFs plan text separately in order to enable search. This way everything will be smooth as the resources will be lazy loaded. In order to achieve this you can do the following.

Firstly On the server end, whenever you publish a PDF, the pages of the PDF should be split into HTML files as explained above. Page thumbs should also be generated from those pages. Assuming that your server is running on python with a flask microframework this is what you do.

from flask import Flask,request
from werkzeug import secure_filename
import os
from pyPdf import PdfFileWriter, PdfFileReader
import imgkit
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.layout import LAParams
import io
import sqlite3
import Image

app = Flask(__name__)


@app.route('/publish',methods=['GET','POST'])
def upload_file():
     if request.method == 'POST':
        f = request.files['file']
        filePath = "pdfs/"+secure_filename(f.filename)
        f.save(filePath)
        savePdfText(filePath)
        inputpdf = PdfFileReader(open(filePath, "rb"))

        for i in xrange(inputpdf.numPages):
            output = PdfFileWriter()
            output.addPage(inputpdf.getPage(i))
            with open("document-page%s.pdf" % i, "wb") as outputStream:
                output.write(outputStream)
                imgkit.from_file("document-page%s.pdf" % i, "document-page%s.jpg" % i)
                saveThum("document-page%s.jpg" % i)
                os.system("pdf2htmlEX --zoom 1.3  pdf/"+"document-page%s.pdf" % i) 

    def saveThum(infile):
        save = 124,124
        outfile = os.path.splitext(infile)[0] + ".thumbnail"
        if infile != outfile:
            try:
                im = Image.open(infile)
                im.thumbnail(size, Image.ANTIALIAS)
                im.save(outfile, "JPEG")
            except IOError:
                print("cannot create thumbnail for '%s'" % infile)

    def savePdfText(data):
        fp = open(data, 'rb')
        rsrcmgr = PDFResourceManager()
        retstr = io.StringIO()
        codec = 'utf-8'
        laparams = LAParams()
        device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
        # Create a PDF interpreter object.
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        # Process each page contained in the document.
        db = sqlite3.connect("pdfText.db")
        cursor = db.cursor()
        cursor.execute('create table if not exists pagesTextTables(id INTEGER PRIMARY KEY,pageNum TEXT,pageText TEXT)')
        db.commit()
        pageNum = 1
        for page in PDFPage.get_pages(fp):
            interpreter.process_page(page)
            data =  retstr.getvalue()
            cursor.execute('INSERT INTO pagesTextTables(pageNum,pageText) values(?,?) ',(str(pageNum),data ))
            db.commit()
            pageNum = pageNum+1

    @app.route('/page',methods=['GET','POST'])
    def getPage():
        if request.method == 'GET':
            page_num = request.files['page_num']
            return send_file("document-page%s.html" % page_num, as_attachment=True)

    @app.route('/thumb',methods=['GET','POST'])
    def getThum():
        if request.method == 'GET':
            page_num = request.files['page_num']
            return send_file("document-page%s.thumbnail" % page_num, as_attachment=True)

    @app.route('/search',methods=['GET','POST'])
    def search():
        if request.method == 'GET':
            query = request.files['query ']       
            db = sqlite3.connect("pdfText.db")
            cursor = db.cursor()
           cursor.execute("SELECT * from pagesTextTables Where pageText LIKE '%"+query +"%'")
           result = cursor.fetchone()
           response = Response()
           response.headers['queryResults'] = result 
           return response

Here is an explanation of what the flask app is doing.

  1. The /publish route is responsible for the publishing of your magazine, turning very page to HTML, saving the PDFs text to an SQlite db and generating thumbnails for those pages. I've used pyPDF for splitting the PDF to individual pages, pdfToHtmlEx to convert the pages to HTML, imgkit to generate those HTML to images and PIL to generate thumbs from those images. Also, a simple Sqlite db saves the pages' text.
  2. The /page, /thumb and /search routes are self explanatory. They simply return the HTML, thumb or search query results.

Secondly, on the client end you simply download the HTML page whenever the user scrolls to it. Let me give you an example for android OS. Firstly, you'd want to Create some Utils to handle the GET requestrs

public static byte[] GetPage(int mPageNum){
return CallServer("page","page_num",Integer.toString(mPageNum))
}

public static byte[] GetThum(int mPageNum){
return CallServer("thumb","page_num",Integer.toString(mPageNum))
}

private  static byte[] CallServer(String route,String requestName,String requestValue) throws IOException{

        OkHttpClient client = new OkHttpClient.Builder().connectTimeout(30, TimeUnit.SECONDS).writeTimeout(30, TimeUnit.SECONDS).readTimeout(30, TimeUnit.SECONDS).build();
        MultipartBody.Builder mMultipartBody = new MultipartBody.Builder().setType(MultipartBody.FORM).addFormDataPart(requestName,requestValue);

        RequestBody mRequestBody = mMultipartBody.build();
        Request request = new Request.Builder()
                .url("yourUrl/"+route).post(mRequestBody)
                .build();
        Response response = client.newCall(request).execute();
        return response.body().bytes();
    }

The helper utils above simple handle the queries to the server for you, they should be self explanatory. Next, you simple create an RecyclerView with a WebView viewHolder or better yet an advanced webview as it will give you more power with customization.

    public static class ViewHolder extends RecyclerView.ViewHolder {
        private AdvancedWebView mWebView;
        public ViewHolder(View itemView) {
            super(itemView);
         mWebView = (AdvancedWebView)itemView;}
    }
    private class ContentAdapter extends RecyclerView.Adapter<YourFrament.ViewHolder>{
        @Override
        public ViewHolder onCreateViewHolder(ViewGroup container, int viewType) {

            return new ViewHolder(new AdvancedWebView(container.getContext()));
        }

        @Override
        public int getItemViewType(int position) {

            return 0;
        }

        @Override
        public void onBindViewHolder( ViewHolder holder, int position) {
handlePageDownload(holder.mWebView);
        }
       private void handlePageDownload(AdvancedWebView mWebView){....}

        @Override
        public int getItemCount() {
            return numberOfPages;
        }
    }

That should be about it.

like image 172
Niza Siwale Avatar answered Oct 18 '22 08:10

Niza Siwale