Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract part of an XML file as plain text using XSLT

Tags:

xml

xslt

Seems like this should be easy, but ...

I'm trying to use XSLT to extract part of an XML file as plain text, throwing away the rest.

So from sample input like this ...

<?xml version="1.0" encoding="UTF-8"?>
<?oxygen RNGSchema="http://segonku.unl.edu/teianalytics/TEIAnalytics.rng"
                        type="xml"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" n="Wright2-0034.sgml.xml">
   <teiHeader type="text">
      <fileDesc>
         <titleStmt>
            <title>Header Title</title>
         </titleStmt>
         <publicationStmt>
            <p>Published</p>
         </publicationStmt>
         <sourceDesc>
            <p>Sourced</p>
         </sourceDesc>
      </fileDesc>
   </teiHeader>
   <text>
      <front>
      </front>
      <body>
         <head>THE TITLE</head>
         <div type="chapter" part="N" org="uniform" sample="complete">
            <head>CHAPTER I</head>
            <p>Some text.</p>
         </div>
      </body>
   </text>
</TEI>

... I'm trying to get just the text contained within the <body> tags and all their children. The desired output in this case is:

THE TITLE
CHAPTER I
Some text.

Potential complication: <body> can also exist in the <front> matter and/or in the <teiHeader>, so what I really need is the children of <body> if and only if that tag is a child of <text> and of <TEI>.

I've tried really simple XSL like this ...

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <xsl:output method="text"/>
    <xsl:template match="/TEI/text/body">
        <xsl:apply-templates select="."/>
    </xsl:template>
</xsl:stylesheet>

... but it gives me plain text of everything in the file, not just the <body> elements.

Thanks!

like image 685
Matthew Wilkens Avatar asked Jun 17 '11 23:06

Matthew Wilkens


People also ask

Which XSL element is used to extract information from XML document?

The XSLT <xsl:value-of> element is used to extract the value of selected node. It puts the value of selected node as per XPath expression, as text.

What is text () in XSLT?

The <xsl:text> element is used to write literal text to the output. Tip: This element may contain literal text, entity references, and #PCDATA.

How do you escape special characters in XSLT?

The regular expression language provides escapes for some special characters. They are: \n. The newline character ( &#xA; )


1 Answers

I've tried really simple XSL like this ...

...

     <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 version="1.0">
         <xsl:output method="text"/>
         <xsl:template match="/TEI/text/body">
             <xsl:apply-templates select="."/>
         </xsl:template>
     </xsl:stylesheet>

... but it gives me plain text of everything in the file, not just the <body> elements.

The reason for this is a famous property/feature of XPath (and reason for many thousands similar questions) to consider any unprefixed name as belonging to "no namespace. However, any element in the provided XML document belongs to the namespace: "http://www.tei-c.org/ns/1.0" and must be accessed as a node in this namespace.

Solution: Define the documents default namespace in the XSLT code (this time with a prefix bound to it) and use the prefix in specifying every name.

This is one of the simplest and shortest possible transformations that produces the wanted result:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:x="http://www.tei-c.org/ns/1.0">
 <xsl:output method="text"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="x:text/x:body//text()">
  <xsl:value-of select="concat(.,'&#xA;')"/>
 </xsl:template>
 <xsl:template match="text()"/>
</xsl:stylesheet>

When applied on the provided XML document:

<TEI xmlns="http://www.tei-c.org/ns/1.0" n="Wright2-0034.sgml.xml">
    <teiHeader type="text">
        <fileDesc>
            <titleStmt>
                <title>Header Title</title>
            </titleStmt>
            <publicationStmt>
                <p>Published</p>
            </publicationStmt>
            <sourceDesc>
                <p>Sourced</p>
            </sourceDesc>
        </fileDesc>
    </teiHeader>
    <text>
        <front>      </front>
        <body>
            <head>THE TITLE</head>
            <div type="chapter" part="N" org="uniform" sample="complete">
                <head>CHAPTER I</head>
                <p>Some text.</p>
            </div>
        </body>
    </text>
</TEI>

the wanted, correct result is produced:

THE TITLE
CHAPTER I
Some text.
like image 81
Dimitre Novatchev Avatar answered Oct 24 '22 00:10

Dimitre Novatchev