Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to display pdf file contents as well as its full name in the browser using cgi python script?

I wish to display the full path of the pdf file along with its contents displayed on the browser. My script has an input html, where user will input file name and submit the form. The script will search for the file, if found in the subdirectories will output the file contents into the browser and also display its name. I am able to display the contents but unable to display the full fine name also simultaneously Or if I display the filename I get garbage character display for the contents. Please guide.

enter link description here

script a.py:

import os
import cgi
import cgitb 
cgitb.enable()
import sys
import webbrowser

def check_file_extension(display_file):
    input_file = display_file
    nm,file_extension = os.path.splitext(display_file)
    return file_extension

form = cgi.FieldStorage()

type_of_file =''
file_nm = ''
nm =''
not_found = 3

if form.has_key("file1"):
    file_nm = form["file1"].value

type_of_file = check_file_extension(file_nm)

pdf_paths = [ '/home/nancy/Documents/',]

# Change the path while executing on the server , else it will throw error 500
image_paths = [ '/home/nancy/Documents/']


if type_of_file == '.pdf':
    search_paths = pdf_paths
else:
    # .jpg
    search_paths = image_paths
for path in search_paths:
    for root, dirnames, filenames in os.walk(path):
        for f in filenames:
            if f == str(file_nm).strip():
                absolute_path_of_file = os.path.join(root,f)
                # print 'Content-type: text/html\n\n'
                # print '<html><head></head><body>'
                # print absolute_path_of_file
                # print '</body></html>'
#                 print """Content-type: text/html\n\n
# <html><head>absolute_path_of_file</head><body>
# <img src=file_display.py />
# </body></html>"""
                not_found = 2
                if  search_paths == pdf_paths:
                    print 'Content-type: application/pdf\n'
                else:
                    print 'Content-type: image/jpg\n'
                file_read = file(absolute_path_of_file,'rb').read()
                print file_read
                print 'Content-type: text/html\n\n'
                print absolute_path_of_file
                break
        break
    break

if not_found == 3:
    print  'Content-type: text/html\n'
    print '%s not found' % absolute_path_of_file

The html is a regular html with just 1 input field for file name.

like image 924
user956424 Avatar asked Oct 01 '22 05:10

user956424


1 Answers

It is not possible. At least not that simple. Some web browsers don't display PDFs but ask the user to download the file, some display them themselves, some embed an external PDF viewer component, some start an external PDF viewer. There is no standard, cross browser way to embed PDF into HTML, which would be needed if you want to display arbitrary text and the PDF content.

A fallback solution, working on every browser, would be rendering the PDF pages on the server as images and serve those to the client. This puts some stress on the server (processor, memory/disk for caching, bandwidth).

Some modern, HTML5 capable browsers can render PDFs with Mozilla's pdf.js on a canvas element.

For other's you could try to use <embed>/<object> to use Adobe's plugin as described on Adobe's The PDF Developer Junkie Blog.


Rendering the pages on the server

Rendering and serving the PDF pages as images needs some software on the server to query the number of pages and to extract and render a given page as image.

The number of pages can be determined with the pdfinfo program from Xpdf or the libpoppler command line utilities. Converting a page from the PDF file to a JPG image can be done with convert from the ImageMagick tools. A very simple CGI program using these programs:

#!/usr/bin/env python
import cgi
import cgitb; cgitb.enable()
import os
from itertools import imap
from subprocess import check_output

PDFINFO = '/usr/bin/pdfinfo'
CONVERT = '/usr/bin/convert'
DOC_ROOT = '/home/bj/Documents'

BASE_TEMPLATE = (
    'Content-type: text/html\n\n'
    '<html><head><title>{title}</title></head><body>{body}</body></html>'
)
PDF_PAGE_TEMPLATE = (
    '<h1>{filename}</h1>'
    '<p>{prev_link} {page}/{page_count} {next_link}</p>'
    '<p><img src="{image_url}" style="border: solid thin gray;"></p>'
)

SCRIPT_NAME = os.environ['SCRIPT_NAME']


def create_page_url(filename, page_number, type_):
    return '{0}?file={1}&page={2}&type={3}'.format(
        cgi.escape(SCRIPT_NAME, True),
        cgi.escape(filename, True),
        page_number,
        type_
    )


def create_page_link(text, filename, page_number):
    text = cgi.escape(text)
    if page_number is None:
        return '<span style="color: gray;">{0}</span>'.format(text)
    else:
        return '<a href="{0}">{1}</a>'.format(
            create_page_url(filename, page_number, 'html'), text
        )


def get_page_count(filename):

    def parse_line(line):
        key, _, value = line.partition(':')
        return key, value.strip()

    info = dict(
        imap(parse_line, check_output([PDFINFO, filename]).splitlines())
    )
    return int(info['Pages'])


def get_page(filename, page_index):
    return check_output(
        [
            CONVERT,
            '-density', '96',
            '{0}[{1}]'.format(filename, page_index),
            'jpg:-'
        ]
    )


def send_error(message):
    print BASE_TEMPLATE.format(
        title='Error', body='<h1>Error</h1>{0}'.format(message)
    )


def send_page_html(_pdf_path, filename, page_number, page_count):
    body = PDF_PAGE_TEMPLATE.format(
        filename=cgi.escape(filename),
        page=page_number,
        page_count=page_count,
        image_url=create_page_url(filename, page_number, 'jpg'),
        prev_link=create_page_link(
            '<<', filename, page_number - 1 if page_number > 1 else None
        ),
        next_link=create_page_link(
            '>>',
            filename,
            page_number + 1 if page_number < page_count else None
        )
    )
    print BASE_TEMPLATE.format(title='PDF', body=body)


def send_page_image(pdf_path, _filename, page_number, _page_count):
    image_data = get_page(pdf_path, page_number - 1)
    print 'Content-type: image/jpg'
    print 'Content-Length:', len(image_data)
    print
    print image_data


TYPE2SEND_FUNCTION = {
    'html': send_page_html,
    'jpg': send_page_image,
}


def main():
    form = cgi.FieldStorage()
    filename = form.getfirst('file')
    page_number = int(form.getfirst('page', 1))
    type_ = form.getfirst('type', 'html')

    pdf_path = os.path.abspath(os.path.join(DOC_ROOT, filename))
    if os.path.exists(pdf_path) and pdf_path.startswith(DOC_ROOT):
        page_count = get_page_count(pdf_path)
        page_number = min(max(1, page_number), page_count)
        TYPE2SEND_FUNCTION[type_](pdf_path, filename, page_number, page_count)
    else:
        send_error(
            '<p>PDF file <em>{0!r}</em> not found.</p>'.format(
                cgi.escape(filename)
            )
        )


main()

There is Python bindings for libpoppler, so the call to the external pdfinfo program could be replaced quite easily with that module. It may also be used to extract more information for the pages like links on the PDF pages to create HTML image maps for them. With the libcairo Python bindings installed it may be even possible to do the rendering of a page without an external process.

like image 152
BlackJack Avatar answered Oct 04 '22 18:10

BlackJack