Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use XSLT to mark up text matching regex?

I am trying to use XSLT 2.0 (Saxon-PE 9.6) on an HTML document to create tags that surround all contiguous runs of characters from a specified non-Latin Unicode block (spaces allowed). I need to apply this process to every text() node in the document. I have made some progress with two approaches that use <xsl:analyze-string> and using fn:replace() but I've not been able to arrive at a satisfactory and complete solution.

For example, here is some text containing Hindi:

Input: <p>चाय का कप means ‘cup of tea’ in हिन्दि.</p>

Desired Output: <p><span xml:lang="hi-Deva">चाय का कप</span> means ‘cup of tea’ in <span xml:lang="hi-Deva">हिन्दि</span>.</p>

How can this process be implemented in XSLT 2.0?


Here's my attempt with <xsl:analyze-string>:

(Note: the Hindi language uses the Devanagari code block U+0900 to U+097F.)

<xsl:template match="text()">
  <xsl:variable name="textValue" select="."/>

  <xsl:analyze-string select="$textValue" regex="(\s*.*?)([&#x0900;-&#x097f;]+)((\s+[&#x0900;-&#x097f;]+)*)(\s*.*)">

    <xsl:matching-substring>
      <xsl:value-of select="regex-group(1)"/>
      <span xml:lang="hi-Deva"><xsl:value-of select="regex-group(2)"/><xsl:value-of select="regex-group(3)"/></span>
      <xsl:value-of select="regex-group(5)"/>
    </xsl:matching-substring>

    <xsl:non-matching-substring>
      <xsl:value-of select="$textValue"/>
    </xsl:non-matching-substring>

  </xsl:analyze-string>
</xsl:template>

On the test input, this produces: <p><span xml:lang="hi-Deva">चाय का कप</span> means ‘cup of tea’ in हिन्दि.</p> This approach misses the second region of Hindi text (हिन्दि). I need an approach that will find and tag all occurrences matched by the regex.

My second approach used fn:replace():

<xsl:template match="text()">
  <xsl:value-of select='fn:replace(., "[&#x0900;-&#x097f;]+(\s+[&#x0900;-&#x097f;]+)*", "xxx$0xxx")'/>
</xsl:template>

On the test input this produces: <p>xxxचाय का कपxxx means ‘cup of tea’ in xxxहिन्दिxxx.</p> This is clearly incorrect, since the Hindi is wrapped in xxx’s, not span tags, but on the positive side, each region of Hindi is in fact discovered and processed. I cannot replace the xxx code with span tags because that is invalid XSLT.

like image 304
keithm Avatar asked Jan 09 '23 22:01

keithm


2 Answers

I came up with http://xsltransform.net/jyH9rMo which just does

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
    <xsl:output method="html" doctype-public="XSLT-compat" omit-xml-declaration="yes" encoding="UTF-8" indent="yes" />

    <xsl:template match="/">
      <hmtl>
        <head>
          <title>New Version!</title>
        </head>
        <xsl:apply-templates/>
      </hmtl>
    </xsl:template>

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="text()">
   <xsl:analyze-string select="." regex="([&#x0900;-&#x097f;]+)((\s+[&#x0900;-&#x097f;]+)*)">

    <xsl:matching-substring>
      <span xml:lang="hi-Deva"><xsl:value-of select="."/></span>
    </xsl:matching-substring>

    <xsl:non-matching-substring>
      <xsl:value-of select="."/>
    </xsl:non-matching-substring>

  </xsl:analyze-string>       
    </xsl:template>
</xsl:transform>
like image 191
Martin Honnen Avatar answered Jan 15 '23 19:01

Martin Honnen


Complementing the previous answers, you might like to note that you can write \p{IsDevanagari} in place of [&#x0900;-&#x097f;]

like image 24
Michael Kay Avatar answered Jan 15 '23 21:01

Michael Kay