Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

From MS Word or Libre Office to clean HTML

People that send content to my website use Word, so I get a lot of Word documents to convert to HTML. I want to conserve only the basic formatting - headings, lists and emphasis - no images.

When I convert them with Libre Office "Save as HTML", the resulting files are huge, for example, a doc file of 112K becomes 450K HTML, most of it useless FONT and SPAN tags (for some reason, every single punctuation mark is enclosed in its own span!).

I tried this script: http://www.techrepublic.com/blog/opensource/how-to-convert-doc-and-odf-files-to-clean-and-lean-html/3708 based on tidy and sed, and it reduced the size to about 150K, but there are still many useless SPANs.

I tried to copy and past into Kompozer - an HTML editor, and then save as HTML; but it converted all my non-Latin (Hebrew) letters to entities such as "ְ", which increased the size to 750K!

I tried docvert: https://github.com/holloway/docvert/issues/6 but found out that it requires a python library that requires another libraries, etc., which seems like an endless route of dependencies...

Is there a simple way to create clean HTML from Office documents?

like image 935
Erel Segal-Halevi Avatar asked Jan 24 '13 07:01

Erel Segal-Halevi


2 Answers

I was using http://word2cleanhtml.com/ till i realised that MS Word itself gives the option to save document as HTML.

On selecting this, the .docx file becomes .html and is the best html version of a word doc that i've seen. Its certainly better than all these online tools.

like image 157
Tarun Avatar answered Oct 01 '22 00:10

Tarun


I realize this question is old but the other answers never really answered the question. If you are not adverse to writing some PHP code, the CubicleSoft Ultimate Web Scraper Toolkit has a class called TagFilter:

https://github.com/cubiclesoft/ultimate-web-scraper/blob/master/support/tag_filter.php

You pass in two things: An array of options and the data to parse as HTML.

For cleaning up broken HTML, the default options from TagFilter::GetHTMLOptions() will act as a good starting point. Those options form the basis of valid HTML content and, doing nothing else, will clean up any input data into something that another tool like Simple HTML DOM can correctly parse in a DOM model.

However, the other way to use the class is to modify the default options and add a 'callback' option to the options array. For every tag in the HTML, the specified callback function will be called. The callback is expected to return what to do with each tag, which is where the real power of TagFilter comes into play. You can keep any given tag and some or all of its attributes (or modifying them), get rid of the tag but keep the interior content, keep the tag but get rid of the content, modify the content (for closing tags), or get rid of both the tag and interior content. This approach allows extremely refined control over the most convoluted HTML out there and processes the input in a single pass. See the same repository's test suite for example usage of TagFilter.

The only downside is that the callback has to keep track of where it is at between each call whereas something like Simple HTML DOM selects things based on a DOM-like model. BUT that's only a drawback if the document being processed has things like 'id's and 'class'es...most Word/Libre HTML content does not, which means it is a giant blob of unrecognizable/unparseable HTML as far as DOM processing tools go.

like image 22
CubicleSoft Avatar answered Oct 01 '22 00:10

CubicleSoft