Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting PDF to HTML with Python [duplicate]

Tags:

How can I convert PDF files to HTML with Python?

I was thinking something alone the lines of what Google does (or seems to do) to index PDF files.

My final goal is to setup Apache to show the HTML for the PDF files, so anything leading me in that direction would also be appreciated.

like image 933
Marcos Lara Avatar asked Nov 09 '08 20:11

Marcos Lara


People also ask

How do I split a PDF into multiple files in Python?

Just replace from pyPdf import ... with from PyPDF2 import ... . User with open("document-page%s. pdf" % (i+1), "wb") as outputStream: if you want your files to be named with index starting from 1 instead of 0. If i want to split 100 instead of split 1 page individual i want to save 2 in 1 pdf.

Can I extract data from PDF using Python?

There are a couple of Python libraries using which you can extract data from PDFs. For example, you can use the PyPDF2 library for extracting text from PDFs where text is in a sequential or formatted manner i.e. in lines or forms. You can also extract tables in PDFs through the Camelot library.


1 Answers

The poppler package provides a pdf2html utility that you might be able to use. There is also a Python binding to libpoppler.

like image 114
Martin v. Löwis Avatar answered Oct 15 '22 19:10

Martin v. Löwis