How can I convert PDF files to HTML with Python?
I was thinking something alone the lines of what Google does (or seems to do) to index PDF files.
My final goal is to setup Apache to show the HTML for the PDF files, so anything leading me in that direction would also be appreciated.
Just replace from pyPdf import ... with from PyPDF2 import ... . User with open("document-page%s. pdf" % (i+1), "wb") as outputStream: if you want your files to be named with index starting from 1 instead of 0. If i want to split 100 instead of split 1 page individual i want to save 2 in 1 pdf.
There are a couple of Python libraries using which you can extract data from PDFs. For example, you can use the PyPDF2 library for extracting text from PDFs where text is in a sequential or formatted manner i.e. in lines or forms. You can also extract tables in PDFs through the Camelot library.
The poppler package provides a pdf2html utility that you might be able to use. There is also a Python binding to libpoppler.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With