i run a job search site, and i need to convert doc, docx and pdf files into HTML on linux CentOS server running php. People submit these files as resumes. So far, I found PHPDocx to be great at converting docx to html. But I am stuck at doc/pdf. PDFTOHTML gives error "bad color" when i run tests. As far as doc, i only found wvwave, which seems complex and bulky to install.
does anyone have any ideas on how to easily convert doc/pdf to HTML?
The only thing i can think of is FPDF. It is intended for creating PDF files in PHP but it can also open PDF files. Maybe you can use that as a base and develop some sort of toHTML function for it.
It is completely free to use and it has some extensions already. It MIGHT help you.
http://www.fpdf.org
EDIT: Thanks for the addition to my post in the comments to Pierre:
You can use fpdi: http://www.setasign.de/products/pdf-php-solutions/fpdi but the input pdf is just like an image.
I havent taken a look at it myself so far but this might help.
As far as .doc files go how about trying OpenOffice/LibreOffice, something like:
lowriter -convert-to html doc_file.doc –
As far as PDF goes, if the PDF is a graphical representation of text then you're out of luck, best you can do is try convert it to an image with ImageMagick, if it is a proper text it should easily convert.
There are various tools out there already to do this, such as http://dag.wieers.com/home-made/unoconv/, http://www.phpdocx.com/ (which you've already tried)
http://www.phplivedocx.org/2009/08/13/convert-docx-doc-rtf-to-html-in-php/ looks promising.
Or, you could install a portable version of libreoffice on your server which allows command line conversion https://help.libreoffice.org/Common/Starting_the_Software_With_Parameters
I'm sure there'll be tutorials out there (on libreoffice support area)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With