Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I convert a .docx to html using asp.net?

Word 2007 saves its documents in .docx format which is really a zip file with a bunch of stuff in it including an xml file with the document.

I want to be able to take a .docx file and drop it into a folder in my asp.net web app and have the code open the .docx file and render the (xml part of the) document as a web page.

I've been searching the web for more information on this but so far haven't found much. My questions are:

  1. Would you (a) use XSLT to transform the XML to HTML, or (b) use xml manipulation libraries in .net (such as XDocument and XElement in 3.5) to convert to HTML or (c) other?
  2. Do you know of any open source libraries/projects that have done this that I could use as a starting point?

Thanks!

like image 806
Guy Avatar asked Sep 10 '08 19:09

Guy


4 Answers

Try this post? I don't know but might be what you are looking for.

like image 188
Adam Lerman Avatar answered Nov 02 '22 05:11

Adam Lerman


I wrote mammoth.js, which is a JavaScript library that converts docx files to HTML. If you want to do the rendering server-side in .NET, there is also a .NET version of Mammoth available on NuGet.

Mammoth tries to produce clean HTML by looking at semantic information -- for instance, mapping paragraph styles in Word (such as Heading 1) to appropriate tags and style in HTML/CSS (such as <h1>). If you want something that produces an exact visual copy, then Mammoth probably isn't for you. If you have something that's already well-structured and want to convert that to tidy HTML, Mammoth might do the trick.

like image 42
Michael Williamson Avatar answered Nov 02 '22 05:11

Michael Williamson


Word 2007 has an API that you can use to convert to HTML. Here's a post that talks about it http://msdn.microsoft.com/en-us/magazine/cc163526.aspx. You can find documentation around the API, but I remember that there is a convert to HTML function in the API.

like image 38
Vaibhav Avatar answered Nov 02 '22 04:11

Vaibhav


This code will helps to convert .docx file to text

function read_file_docx($filename){

    $striped_content = '';
    $content = '';

    if(!$filename || !file_exists($filename)) { echo "sucess";}else{ echo "not sucess";}

    $zip = zip_open($filename);

    if (!$zip || is_numeric($zip)) return false;

    while ($zip_entry = zip_read($zip)) {

        if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

        if (zip_entry_name($zip_entry) != "word/document.xml") continue;

        $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

        zip_entry_close($zip_entry);
    }// end while

    zip_close($zip);

    //echo $content;
    //echo "<hr>";
    //file_put_contents('1.xml', $content);     

    $content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
    $content = str_replace('</w:r></w:p>', "\r\n", $content);
     //header("Content-Type: plain/text");


    $striped_content = strip_tags($content);


      $striped_content = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$striped_content);

    echo nl2br($striped_content); 
}
like image 27
raghava Avatar answered Nov 02 '22 04:11

raghava