Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

replace substrings using lookup table with xslt

Tags:

xslt

xslt-2.0

I have a few strings containing a variant of Hexadecimal strings (source is framemaker if one would care). Strings could therefore look like

this is some sentence with some hex code\x27 s , and we need that fixed.

and will need to be changed to

this is some sentence with some hex code's , and we need that fixed.

In reality there can be a few of these in a single string, so I'm looking on the best way to walk through the text, capture all hex codes (looking like \x## ) and replace all of these codes with the correct character. I have made a xml list / lookup table containing all the characters as follows :

<xsl:param name="reflist">
    <Code Value="\x27">'</Code>
<Code Value="\x28">(</Code>
<Code Value="\x29">)</Code>
<Code Value="\x2a">*</Code>
<Code Value="\x2b">+</Code>
    <!-- much more like these... -->
</xsl:param>

For now I used a simple replace argument but there are simply too many characters to make this workable.

What's the best way to do this?

like image 972
Wokoman Avatar asked Feb 19 '23 13:02

Wokoman


1 Answers

One can completely avoid using any "reference table" -- like this:

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:xs="http://www.w3.org/2001/XMLSchema"
 xmlns:my="my:my" exclude-result-prefixes="my xs">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="node()|@*">
  <xsl:copy>
   <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
 </xsl:template>

 <xsl:template match="text()[matches(.,  '\\x(\d|[a-f])+')]">
   <xsl:analyze-string select="." regex="\\x(\d|[a-f])+" >
     <xsl:matching-substring>
       <xsl:value-of select=
       "codepoints-to-string(my:hex2dec(substring(.,3), 0))"/>
     </xsl:matching-substring>
     <xsl:non-matching-substring>
      <xsl:value-of select="."/>
     </xsl:non-matching-substring>
   </xsl:analyze-string>
 </xsl:template>

 <xsl:function name="my:hex2dec" as="xs:integer">
  <xsl:param name="pStr" as="xs:string"/>
  <xsl:param name="pAccum" as="xs:integer"/>

  <xsl:sequence select=
   "if(not($pStr))
     then $pAccum
     else
      for $char in substring($pStr, 1, 1),
          $code in
            if($char ge '0' and $char le '9')
              then xs:integer($char)
              else
                string-to-codepoints($char) - string-to-codepoints('a') +10
       return
          my:hex2dec(substring($pStr,2), 16*$pAccum + $code)
   "/>
 </xsl:function>
</xsl:stylesheet>

When this transformation is applied on the following XML document:

<t>
 <p>this is some sentence with some hex code\x27 s ,
    and we need that fixed.</p>
 <p>this is some sentence with some hex code\x28 s ,
    and we need that fixed.</p>
 <p>this is some sentence with some hex code\x29 s ,
    and we need that fixed.</p>
 <p>this is some sentence with some hex code\x2a s ,
    and we need that fixed.</p>
 <p>this is some sentence with some hex code\x2b s ,
    and we need that fixed.</p>
 <p>this is some sentence with some hex code\x2c s ,
    and we need that fixed.</p>
 <p>this is some sentence with some hex code\x2d s ,
    and we need that fixed.</p>
 <p>this is some sentence with some hex code\x2e s ,
    and we need that fixed.</p>
 <p>this is some sentence with some hex code\x2f s ,
    and we need that fixed.</p>
</t>

the wanted, correct result is produced:

<t>
   <p>this is some sentence with some hex code' s ,
    and we need that fixed.</p>
   <p>this is some sentence with some hex code( s ,
    and we need that fixed.</p>
   <p>this is some sentence with some hex code) s ,
    and we need that fixed.</p>
   <p>this is some sentence with some hex code* s ,
    and we need that fixed.</p>
   <p>this is some sentence with some hex code+ s ,
    and we need that fixed.</p>
   <p>this is some sentence with some hex code, s ,
    and we need that fixed.</p>
   <p>this is some sentence with some hex code- s ,
    and we need that fixed.</p>
   <p>this is some sentence with some hex code. s ,
    and we need that fixed.</p>
   <p>this is some sentence with some hex code/ s ,
    and we need that fixed.</p>
</t>

Do note:

This transformation is generic and can correctly process any hexadecimal unicode code.

For example, if the same transformation is applied on this XML document:

<t>
 <p>this is some sentence with some hex code\x0428\x0438\x0448 s ,
    and we need that fixed.</p>
</t>

the correct result (containing the Bulgarian word for "grill" in Cyrillic) is produced:

<t>
   <p>this is some sentence with some hex codeШиш s ,
    and we need that fixed.</p>
</t>
like image 70
Dimitre Novatchev Avatar answered Feb 27 '23 22:02

Dimitre Novatchev