I have quite big document in html format that generated from Microsoft Word. It is soooo messy and full of bloated things (like unknow tag, unknow namespace etc and other bloated things)
is there any way to convert it into plain html sytax ?
Using MS Words built-in save as HTML optionGo to the file menu. Select Save as. In the drop-down file type box select, Web Page, Filtered. Click Save.
In Word: On the Edit menu, click Clear and then select Clear Formatting.
Try HTML Tidy. I hear it works quite well on HTML generated by MS Word (definitely at least up to Word 2000, but probably on more recent versions too).
This isn't really a programming question, but (at least recent versions of) Word can save to "Web Page, Filtered", which removes Office-specific tags and properties and only leaves the tags necessary for the document to be rendered in a web browser. So, if you have Word, you could try using it to open the HTML document and save it in that format.
try Cleanup HTML on-line tool to clean up word HTML
You're probably looking for HTML Tidy, which has adapters in pretty much every language out there. It has options to clean up Microsoft Word HTML output (and many other features).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With