Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

from HTML <figure> and <figcaption> to Microsoft Word

I have an HTML with thefigure, img and figcaption tags and I would like to get them converted to a Microsoft Word document.

The image referred by img should be inserted in the Word document and the figcaption should be converted to its caption (also keeping the figure number).

I have tried to open the html with Word 2013 but the figcaption is not converted as the figure caption but it is just a simple text below the image.

Is there any minimum working sample to get it done? I had a look at https://en.wikipedia.org/wiki/Microsoft_Office_XML_formats#Word_XML_Format_example but it is too verbose to grab just an Hello world sample.

figure .image {
    width: 100%;
}

figure {
    text-align: center;
    display: table;
    max-width: 30%; /* demo; set some amount (px or %) if you can */
    margin: 10px auto; /* not needed unless you want centered */
}
article {
  counter-reset: figures;
}

figure {
  counter-increment: figures;
}

figcaption:before {
  content: "Fig. " counter(figures) " - "; /* For I18n support; use data-counter-string. */
}
<figure>
<p><img class="image" src="https://upload.wikimedia.org/wikipedia/commons/c/ca/Matterhorn002.jpg"></p>
<figcaption>Il monte Cervino.</figcaption>
</figure>

<figure>
<p><img class="image" src="https://upload.wikimedia.org/wikipedia/commons/2/26/Banner_clouds.jpg"></p>
<figcaption>La nuvola che spesso è vicino alla vetta.</figcaption>
</figure>

I tried with pandoc on Windows

pandoc -f html -t docx -o hello.docx hello.html

but with no luck, as you can see the "Fig. 1" and "Fig. 2" is missing:

enter image description here

My pandoc is:

c:\temp>.\pandoc.exe -v
pandoc.exe 1.19.2.1
Compiled with pandoc-types 1.17.0.4, texmath 0.9, skylighting 0.1.1.4
Default user data directory: C:\Users\ale\AppData\Roaming\pandoc
Copyright (C) 2006-2016 John MacFarlane
Web:  http://pandoc.org
This is free software; see the source for copying conditions.
There is no warranty, not even for merchantability or fitness
for a particular purpose.

Edit 1

It is fine also to use some C# to get it done. Maybe I can transform the HTML to some XML Word format by means of a C# program.

like image 378
Alessandro Jacopson Avatar asked Jul 11 '17 08:07

Alessandro Jacopson


People also ask

How do I insert a figure reference in Word?

Click on "Insert" -> "Reference" -> "Cross-reference". In the dialog box, select "Figure" as "Reference type", then select "Only label and number", pick "Figure 1 ...", click "Insert". Now, we have Figure inserted in the text.

How do you insert figure captions and table titles in Microsoft Word?

Select the object (table, equation, figure, or another object) that you want to add a caption to. On the References tab, in the Captions group, click Insert Caption. In the Label list, select the label that best describes the object, such as a figure or equation.


2 Answers

This may be more roundabout than you would like, but if you save the file as a pdf (I went into adobe and created a pdf from a html file containing figure/figcaption, but you could do that programatically obviously), and then export that pdf file to word, then you can create a word document. Perhaps a middle step too much but it does work!

Hope this is of some assistance (perhaps a pdf would do??)

pdf (zoomed to page level

EDIT 1: I just found a jquery plugin by Mark Windsoll which converts HTML to Word. I made a codepen to include figure /figcaption here. When you press the button it prints as Word. (I suppose you could save it either, but his original code pen didn't actually do anything on click of the link that said export to doc.. sigh..)

 jQuery(document).ready(function print($)  {   
$(".word-export").click(function(event) {
         $("#page-content").wordExport();
     });
 });
img{width:300px;
height:auto;}
figcaption{width:350px;text-align:center;}
h1{margin-top:10px;}
h1, h2{margin-left:35px;}
p{width:95%;
  padding-top:20px;
  margin:0px auto;}
button{margin: 15px 30px; 
padding:5px;}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js"></script>
<script src="https://www.jqueryscript.net/demo/Export-Html-To-Word-Document-With-Images-Using-jQuery-Word-Export-Plugin/FileSaver.js"></script>
<script src="https://www.jqueryscript.net/demo/Export-Html-To-Word-Document-With-Images-Using-jQuery-Word-Export-Plugin/jquery.wordexport.js"></script>

<link href="https://www.jqueryscript.net/css/jquerysctipttop.css" rel="stylesheet"/>

<h1>jQuery Word Export Plugin Demo</h1>
<div id="page-content">
<h2>Lovely Trees</h2>
<figure>
  <img src="http://www.rachelgallen.com/images/autumntrees.jpg"></figure>
  <figcaption>Autumn Trees</figcaption>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec vehicula bibendum lacinia. Pellentesque placerat interdum nisl non semper. Integer ornare, nunc non varius mattis, nulla neque venenatis nibh, vitae cursus risus quam ut nulla. Aliquam erat volutpat. Aliquam erat volutpat. </p>
  <p>And some more text here, but that's quite enough lorem ipsum rubbish!</p>
</div>
<button class="word-export" onclick="print();"> Export as .doc </button>

EDIT 2: To convert HTML to Word using C# you can use Gembox, which is free unless you buy the professional version (you could use it free for a while to evaluate it).

The C# code is

// Convert HTML to Word (DOCX) document.
DocumentModel.Load("Document.html").Save("Document.docx");

Rachel

like image 184
Rachel Gallen Avatar answered Oct 16 '22 17:10

Rachel Gallen


I never used pandoc, i guess it don't support many advanced CSS3 features now.

1. Using Aspose.Words

I copied you CSS&HTML codes to make a Html file named figure.htm, and using Aspose.Words to converted this html file, it works as well as your hope.

Word demo

I using C# to code to like below:

using Aspose.Words;

        Document doc = new Document();
        DocumentBuilder builder = new DocumentBuilder(doc); 
        using (System.IO.StreamReader sr = new System.IO.StreamReader("./figure.htm"))
        {
            string html = sr.ReadToEnd();
            builder.InsertHtml(html);
        }

        doc.Save("d:\\DocumentBuilder.InsertTableFromHtml Out.doc");

My Aspose.Words version is 16.7.0.0.

2. Format figcaption tag

There is an other way to keep using pandoc to make it work. You can handle the Html file to fix format before you convert using pandoc. In your question, the base point is pandoc can't works on many advanced CSS3 features, so if you can finish this then it works well too.

I give some test code for you, and i using 'RegularExpressions'. Run below code, figure1.htm is a new HTML file and it's replace all figcaption's innter HTML to a fix format HTML.

        Regex regex = new Regex("<(?<tag>[a-zA-Z]+?)>(?<html>.+)</\\1>", RegexOptions.Compiled);
        using (System.IO.StreamReader sr = new System.IO.StreamReader("./figure.htm", Encoding.UTF8))
        {
            string html = sr.ReadToEnd();
            int i = 1;

            string newHtml = regex.Replace(html, new MatchEvaluator((m) =>
            {
                string tag = m.Groups["tag"].Value;
                string text = m.Groups["html"].Value;
                if (tag.ToLower() == "figcaption")
                {
                    return $"<{tag}>Fig. {i++} - {text}</{tag}>";
                }
                return m.Value;
            }));

            using (System.IO.StreamWriter sw = new System.IO.StreamWriter("./figure1.htm", false, Encoding.UTF8))
            {
                sw.Write(newHtml);
                sw.Flush();
            }
        }

Format HTML tag

Wish my answer can help you!

like image 27
Johan Shen Avatar answered Oct 16 '22 17:10

Johan Shen