Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting XML to plain text - how should I ignore/handle whitespace in the XSLT?

I'm trying to convert an XML file into the markup used by dokuwiki, using XSLT. This actually works to some degree, but the indentation in the XSL file is getting inserted into the results. At the moment, I have two choices: abandon this XSLT thing entirely, and find another way to convert from XML to dokuwiki markup, or delete about 95% of the whitespace from the XSL file, making it nigh-unreadable and a maintenance nightmare.

Is there some way to keep the indentation in the XSL file without passing all that whitespace on to the final document?

Background: I'm migrating an autodoc tool from static HTML pages over to dokuwiki, so the API developed by the server team can be further documented by the applications team whenever the apps team runs into poorly-documented code. The logic is to have a section of each page set aside for the autodoc tool, and to allow comments anywhere outside this block. I'm using XSLT because we already have the XSL file to convert from XML to XHTML, and I'm assuming it will be faster to rewrite the XSL than to roll my own solution from scratch.

Edit: Ah, right, foolish me, I neglected the indent attribute. (Other background note: I am new to XSLT.) On the other hand, I still have to deal with newlines. Dokuwiki uses pipes to differentiate between table columns, which means that all of the data in a table line must be on one line. Is there a way to suppress newlines being outputted (just occasionally), so I can do some fairly complex logic for each table cell in a somewhat readable fasion?

like image 940
PotatoEngineer Avatar asked Oct 08 '08 19:10

PotatoEngineer


People also ask

How do I remove spaces in XSLT?

XSLT <xsl:strip-space> The <xsl:strip-space> element is used to define the elements for which white space should be removed. Note: Preserving white space is the default setting, so using the <xsl:preserve-space> element is only necessary if the <xsl:strip-space> element is used.

Is there any benefit of converting XML to XSLT?

XSLT is commonly used to convert XML to HTML, but can also be used to transform XML documents that comply with one XML schema into documents that comply with another schema. XSLT can also be used to convert XML data into unrelated formats, like comma-delimited text or formatting languages such as troff.

How XML is converted into HTML format using XSLT?

The standard way to transform XML data into other formats is by Extensible Stylesheet Language Transformations (XSLT). You can use the built-in XSLTRANSFORM function to convert XML documents into HTML, plain text, or different XML schemas. XSLT uses stylesheets to convert XML into other data formats.


2 Answers

There are three reasons for getting unwanted whitespace in the result of an XSLT transformation:

  1. whitespace that comes from between nodes in the source document
  2. whitespace that comes from within nodes in the source document
  3. whitespace that comes from the stylesheet

I'm going to talk about all three because it can be hard to tell where whitespace comes from so you might need to use several strategies.

To address the whitespace that is between nodes in your source document, you should use <xsl:strip-space> to strip out any whitespace that appears between two nodes, and then use <xsl:preserve-space> to preserve the significant whitespace that might appear within mixed content. For example, if your source document looks like:

<ul>   <li>This is an <strong>important</strong> <em>point</em></li> </ul> 

then you will want to ignore the whitespace between the <ul> and the <li> and between the </li> and the </ul>, which is not significant, but preserve the whitespace between the <strong> and <em> elements, which is significant (otherwise you'd get "This is an **important***point*"). To do this use

<xsl:strip-space elements="*" /> <xsl:preserve-space elements="li" /> 

The elements attribute on <xsl:preserve-space> should basically list all the elements in your document that have mixed content.

Aside: using <xsl:strip-space> also reduces the size of the source tree in memory, and makes your stylesheet more efficient, so it's worth doing even if you don't have whitespace problems of this sort.

To address the whitespace that appears within nodes in your source document, you should use normalize-space(). For example, if you have:

<dt>   a definition </dt> 

and you can be sure that the <dt> element won't hold any elements that you want to do something with, then you can do:

<xsl:template match="dt">   ...   <xsl:value-of select="normalize-space(.)" />   ... </xsl:template> 

The leading and trailing whitespace will be stripped from the value of the <dt> element and you will just get the string "a definition".

To address whitespace coming from the stylesheet, which is perhaps the one you're experiencing, is when you have text within a template like this:

<xsl:template match="name">   Name:   <xsl:value-of select="." /> </xsl:template> 

XSLT stylesheets are parsed in the same way as the source documents that they process, so the above XSLT is interpreted as a tree that holds an <xsl:template> element with a match attribute whose first child is a text node and whose second child is a <xsl:value-of> element with a select attribute. The text node has leading and trailing whitespace (including line breaks); since it's literal text in the stylesheet, it gets literally copied over into the result, with all the leading and trailing whitespace.

But some whitespace in XSLT stylesheets get stripped automatically, namely those between nodes. You don't get a line break in your result because there's a line break between the <xsl:value-of> and the close of the <xsl:template>.

To get only the text you want in the result, use the <xsl:text> element like this:

<xsl:template match="name">   <xsl:text>Name: </xsl:text>   <xsl:value-of select="." /> </xsl:template> 

The XSLT processor will ignore the line breaks and indentation that appear between nodes, and only output the text within the <xsl:text> element.

like image 59
JeniT Avatar answered Sep 27 '22 20:09

JeniT


Are you using indent="no" in your output tag?

<xsl:output method="text" indent="no" /> 

Also if you're using xsl:value-of you can use the disable-output-escaping="yes" to help with some whitespace issues.

like image 20
Lindsay Avatar answered Sep 27 '22 20:09

Lindsay