Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert from Word document to HTML

Tags:

html

c#

ms-word

I want to save the Word document in HTML using Word Viewer without having Word installed in my machine. Is there any way to accomplish this in C#?

like image 671
Pankaj Avatar asked Feb 15 '10 13:02

Pankaj


People also ask

How do I convert text to HTML code?

Click on the URL button, Enter URL and Submit. This tool supports loading the Text file to transform to Hyper Text Markup language. Click on the Upload button and select File. String to HTML Online works well on Windows, MAC, Linux, Chrome, Firefox, Edge, and Safari.


4 Answers

For converting .docx file to HTML format, you can use OpenXmlPowerTools. Make sure to add a reference to OpenXmlPowerTools.dll.

using OpenXmlPowerTools;
using DocumentFormat.OpenXml.Wordprocessing;

byte[] byteArray = File.ReadAllBytes(DocxFilePath);
using (MemoryStream memoryStream = new MemoryStream())
{
     memoryStream.Write(byteArray, 0, byteArray.Length);
     using (WordprocessingDocument doc = WordprocessingDocument.Open(memoryStream, true))
     {
          HtmlConverterSettings settings = new HtmlConverterSettings()
          {
               PageTitle = "My Page Title"
          };
          XElement html = HtmlConverter.ConvertToHtml(doc, settings);

          File.WriteAllText(HTMLFilePath, html.ToStringNewLineOnAttributes());
     }
}
like image 54
Krantisinh Patil Avatar answered Oct 16 '22 01:10

Krantisinh Patil


You can try with Microsoft.Office.Interop.Word;

   using Word = Microsoft.Office.Interop.Word;

    public static void ConvertDocToHtml(object Sourcepath, object TargetPath)
    {

        Word._Application newApp = new Word.Application();
        Word.Documents d = newApp.Documents;
        object Unknown = Type.Missing;
        Word.Document od = d.Open(ref Sourcepath, ref Unknown,
                                 ref Unknown, ref Unknown, ref Unknown,
                                 ref Unknown, ref Unknown, ref Unknown,
                                 ref Unknown, ref Unknown, ref Unknown,
                                 ref Unknown, ref Unknown, ref Unknown, ref Unknown);
        object format = Word.WdSaveFormat.wdFormatHTML;



        newApp.ActiveDocument.SaveAs(ref TargetPath, ref format,
                    ref Unknown, ref Unknown, ref Unknown,
                    ref Unknown, ref Unknown, ref Unknown,
                    ref Unknown, ref Unknown, ref Unknown,
                    ref Unknown, ref Unknown, ref Unknown,
                    ref Unknown, ref Unknown);

        newApp.Documents.Close(Word.WdSaveOptions.wdDoNotSaveChanges);


    }
like image 21
Bimzee Avatar answered Oct 16 '22 00:10

Bimzee


I wrote Mammoth for .NET, which is a library that converts docx files to HTML, and is available on NuGet.

Mammoth tries to produce clean HTML by looking at semantic information -- for instance, mapping paragraph styles in Word (such as Heading 1) to appropriate tags and style in HTML/CSS (such as <h1>). If you want something that produces an exact visual copy, then Mammoth probably isn't for you. If you have something that's already well-structured and want to convert that to tidy HTML, Mammoth might do the trick.

like image 37
Michael Williamson Avatar answered Oct 16 '22 01:10

Michael Williamson


I think this will depend on the version of the Word document. If you have them in docx format, I believe they are stored within the file as XML data (but it is so long since I looked at the specification I am perfectly happy to be corrected on that).

like image 45
ZombieSheep Avatar answered Oct 16 '22 01:10

ZombieSheep