Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse mathML in output of WordOpenXML?

I want to read only the xml used for generating equation, which i obtained by using Paragraph.Range.WordOpenXML. But the section used for the equation is not as per MathML which as i found that the Equation of microsoft is in MathML.

Do I need to use some special converter to get desired xmls or are there any other methods?

like image 452
serene Avatar asked May 26 '13 12:05

serene


1 Answers

You could use the OMML2MML.XSL file (located under %ProgramFiles%\Microsoft Office\Office15) to transform Microsoft Office MathML (equations) included in a word document into MathML.

The code below shows how to transform the equations in a word document into MathML using the following steps:

  1. Open the word document using OpenXML SDK (version 2.5).
  2. Create a XslCompiledTransform and load the OMML2MML.XSL file.
  3. Transform the word document by calling the Transform() method on the created XslCompiledTransform instance.
  4. Output the result of the transform (e.g. print on console or write to file).

I've tested the code below with a simple word document containing two equations, text and pictures.

using System.IO;
using System.Xml;
using System.Xml.Xsl;
using DocumentFormat.OpenXml.Packaging;

public string GetWordDocumentAsMathML(string docFilePath, string officeVersion = "14")
{
    string officeML = string.Empty;
    using (WordprocessingDocument doc = WordprocessingDocument.Open(docFilePath, false))
    {
        string wordDocXml = doc.MainDocumentPart.Document.OuterXml;

        XslCompiledTransform xslTransform = new XslCompiledTransform();

        // The OMML2MML.xsl file is located under 
        // %ProgramFiles%\Microsoft Office\Office15\
        xslTransform.Load(@"c:\Program Files\Microsoft Office\Office" + officeVersion + @"\OMML2MML.XSL");

        using (TextReader tr = new StringReader(wordDocXml))
        {
            // Load the xml of your main document part.
            using (XmlReader reader = XmlReader.Create(tr))
            {
                using (MemoryStream ms = new MemoryStream())
                {
                    XmlWriterSettings settings = xslTransform.OutputSettings.Clone();

                    // Configure xml writer to omit xml declaration.
                    settings.ConformanceLevel = ConformanceLevel.Fragment;
                    settings.OmitXmlDeclaration = true;

                    XmlWriter xw = XmlWriter.Create(ms, settings);

                    // Transform our OfficeMathML to MathML.
                    xslTransform.Transform(reader, xw);
                    ms.Seek(0, SeekOrigin.Begin);

                    using (StreamReader sr = new StreamReader(ms, Encoding.UTF8))
                    {
                        officeML = sr.ReadToEnd();
                        // Console.Out.WriteLine(officeML);
                    }
                }
            }
        }
    }
    return officeML;
}

To convert only one single equation (and not the whole word document) just query for the desired Office Math Paragraph (m:oMathPara) and use the OuterXML property of this node. The code below shows how to query for the first math paragraph:

string mathParagraphXml = 
      doc.MainDocumentPart.Document.Descendants<DocumentFormat.OpenXml.Math.Paragraph>().First().OuterXml;

Use the returned XML to feed the TextReader.

like image 62
Hans Avatar answered Sep 21 '22 06:09

Hans