Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to clean up microsoft html doc?

I have quite big document in html format that generated from Microsoft Word. It is soooo messy and full of bloated things (like unknow tag, unknow namespace etc and other bloated things)

is there any way to convert it into plain html sytax ?

like image 914
nightingale2k1 Avatar asked Jun 28 '09 07:06

nightingale2k1


People also ask

How do I clean up HTML in Word?

Using MS Words built-in save as HTML optionGo to the file menu. Select Save as. In the drop-down file type box select, Web Page, Filtered. Click Save.

How do I get rid of HTML formatting in Word?

In Word: On the Edit menu, click Clear and then select Clear Formatting.


4 Answers

Try HTML Tidy. I hear it works quite well on HTML generated by MS Word (definitely at least up to Word 2000, but probably on more recent versions too).

like image 95
David Z Avatar answered Oct 19 '22 03:10

David Z


This isn't really a programming question, but (at least recent versions of) Word can save to "Web Page, Filtered", which removes Office-specific tags and properties and only leaves the tags necessary for the document to be rendered in a web browser. So, if you have Word, you could try using it to open the HTML document and save it in that format.

like image 24
Vladimir Panteleev Avatar answered Oct 19 '22 02:10

Vladimir Panteleev


try Cleanup HTML on-line tool to clean up word HTML

like image 1
robeid Avatar answered Oct 19 '22 02:10

robeid


You're probably looking for HTML Tidy, which has adapters in pretty much every language out there. It has options to clean up Microsoft Word HTML output (and many other features).

like image 2
cletus Avatar answered Oct 19 '22 04:10

cletus