This might have been asked in another way. I am not doing it on the fly however. Once in a while we get pieces of content in word files that have em dashes, bold, italic text and block quotes. Is there a good tool to convert this into a clean html code.
Otherwise what other approaches people take.
A long time ago I was tasked with taking a reasonably well structured multi-megabyte word document and converting it into a series of HTML pages (about 20,000 of them!) This was accomplished by saving the word doc as RTF (Word's Save As HTML output was much too "dirty") and converting the RTF to HTML via a Perl script. The conversion was a two pass process... First clean up common formatting errors, then convert the cleaned RTF to HTML.
Since the document editors continued to maintain the Word document, it payed to codify common formatting errors in the first pass because the errors often reoccurred even after being fixed.
Incidentally, this process showed a very skeptical management how in just 40 hours (or so) a good coder could produce ~20,000 web pages and keep them up to date indefinitely, while the original authors (who's time was even more valuable) would have spend multiple hundreds of hours doing the conversion and would have been forced to maintain the resulting HTML by hand thereafter.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With