Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

group by multiple attributes from xml with xslt

Tags:

xml

xslt

xslt-1.0

I have the following xml

<smses>
  <sms address="87654321" type="1" body="Some text" readable_date="3/09/2011 2:16:52 PM" contact_name="Person1" />
  <sms address="87654321" type="2" body="Some text" readable_date="3/09/2011 2:36:41 PM" contact_name="Person1" />
  <sms address="87654321" type="1" body="Some text" readable_date="3/09/2011 2:16:52 PM" contact_name="Person1" />
  <sms address="123" type="2" body="Some text" readable_date="3/09/2011 10:56:24 AM" contact_name="Person2" />
  <sms address="123" type="1" body="Some text" readable_date="3/09/2011 10:57:52 AM" contact_name="Person2" />
  <sms address="123" type="2" body="Some text" readable_date="3/09/2011 10:56:24 AM" contact_name="Person2" />
  <sms address="12345678" type="1" body="Some text" readable_date="3/09/2011 11:21:16 AM" contact_name="Person3" />
  <sms address="12345678" type="2" body="Some text" readable_date="3/09/2011 11:37:21 AM" contact_name="Person3" />

  <sms address="12345" type="2" body="Some text" readable_date="28/01/2011 7:24:50 PM" contact_name="(Unknown)" />
  <sms address="233" type="1" body="Some text" readable_date="30/12/2010 1:13:41 PM" contact_name="(Unknown)" />
</smses>

I am trying to get an ouput like this (e.g. xml)

<sms contact_name="person1">
    <message type="1">{@body}</message>
    <message type="2">{@body}</message>
    <message type="1">{@body}</message>
</sms>
<sms contact_name="person2">
    <message type="2">{@body}</message>
    <message type="1">{@body}</message>
</sms>
<sms contact_name="person3">
    <message type="2">{@body}</message>
    <message type="1">{@body}</message>
</sms>
<sms contact_name="(Unknown)">
    <message type="2">{@body}</message>
    <message type="1">{@body}</message>
</sms>
<sms contact_name="(Unknown)">
    <message type="2">{@body}</message>   
</sms>

e.g. html

<div>
  <h1>Person: @contact_name (@address)</h1>
  <p>message @type: @body</p>
</div>

I have managed to do this with the following XSLT code (please excuse the code below does not reflect the html entirely, the output is the desired result!)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" indent="yes" />
    <xsl:key name="txt" match="sms" use="@contact_name" />
    <xsl:template match="smses">
        <xsl:apply-templates select="sms[generate-id(.)=generate-id(key('txt', @contact_name)[1])]">
            <xsl:sort select="@address" order="ascending" />
        </xsl:apply-templates>
    </xsl:template>
    <xsl:template match="sms">
        <h4><xsl:value-of select="@contact_name"  /></h4>
            <xsl:for-each select="key('txt', @contact_name)">
                    <br />
                    <xsl:value-of select="@body" />
            </xsl:for-each>
    </xsl:template>

</xsl:stylesheet>

The problem I have is, or rather the question I'm asking. I have a sms element with a @contact_name attribute that is "(unknown)" but the @address is unique between both elements, i.e. they should not be grouped together, because the sms message came from a different number/person (even though the contact name is the same, its irrelevant). Should I be trying to reorder/change the XML data or is there a way to get XSLT to recognise the group for unknown should check if the @address is different if the @contact_name is the same.

Edit:

I failed to mention (or rather forgot) that while there are some sms messages with same @contact_name and unique @address there is also cases where some of the @address fields have slight discrepancy where they don't have the country code in front of the number, e.g.

<sms contact_name="jared" address="12345" />
<sms contact_name="jared" address="+64112345" />

But they are meant to be grouped because they are from the same person/number.

Edit:

In my situation there would only be discrepancies of having 3 character (e.g. +64) country code plus 2 digit network code (e.g. 21). Basically the outcome should be, if @contact_name = same and @address is completely different i.e.

 <sms contact_name="jared" address="12345" />
 <sms contact_name="jared" address="5433467" />

then they should be seperate elements, as they are from different people/number(s).

if @contact_name = same and @address is different only by country and network codes i.e.

 <sms contact_name="jared" address="02112345" />
 <sms contact_name="jared" address="+642112345" />

then they should be grouped as they are from the same person/number

Edit:

country codes: +64 (3 characters)

network codes: 021 (3 characters, usually last character changes depending on network)

Numbers (@address) get saved per <sms> either as +64-21-12345 (excluding dashes) or 021-12345(excluding dash).

like image 687
Jared Avatar asked Sep 14 '11 01:09

Jared


1 Answers

This transformation uses Muenchian grouping with composite keys:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:key name="kContactByNameAddress" match="sms"
          use="concat(@contact_name,'+',@address)"/>

 <xsl:template match=
    "sms[generate-id()
        =
         generate-id(key('kContactByNameAddress',
                         concat(@contact_name,'+',@address)
                        )
                         [1]
                     )
        ]
    ">
     <sms contact_name="{@contact_name}">
       <xsl:apply-templates mode="inGroup"
       select="key('kContactByNameAddress',
                 concat(@contact_name,'+',@address)
                )"/>
     </sms>
 </xsl:template>

 <xsl:template match="sms" mode="inGroup">
       <message type="{@type}">
         <xsl:value-of select="@body"/>
       </message>
 </xsl:template>
 <xsl:template match="sms"/>
</xsl:stylesheet>

When applied to the provided XML document:

<smses>
    <sms address="87654321" type="1" body="Some text"
    readable_date="3/09/2011 2:16:52 PM" contact_name="Person1" />
    <sms address="87654321" type="2" body="Some text"
    readable_date="3/09/2011 2:36:41 PM" contact_name="Person1" />
    <sms address="87654321" type="1" body="Some text"
    readable_date="3/09/2011 2:16:52 PM" contact_name="Person1" />
    <sms address="123" type="2" body="Some text"
    readable_date="3/09/2011 10:56:24 AM" contact_name="Person2" />
    <sms address="123" type="1" body="Some text"
    readable_date="3/09/2011 10:57:52 AM" contact_name="Person2" />
    <sms address="123" type="2" body="Some text"
    readable_date="3/09/2011 10:56:24 AM" contact_name="Person2" />
    <sms address="12345678" type="1" body="Some text"
    readable_date="3/09/2011 11:21:16 AM" contact_name="Person3" />
    <sms address="12345678" type="2" body="Some text"
    readable_date="3/09/2011 11:37:21 AM" contact_name="Person3" />
    <sms address="12345" type="2" body="Some text"
    readable_date="28/01/2011 7:24:50 PM" contact_name="(Unknown)" />
    <sms address="233" type="1" body="Some text"
    readable_date="30/12/2010 1:13:41 PM" contact_name="(Unknown)" />
</smses>

the wanted, correct result is produced:

<sms contact_name="Person1">
   <message type="1">Some text</message>
   <message type="2">Some text</message>
   <message type="1">Some text</message>
</sms>
<sms contact_name="Person2">
   <message type="2">Some text</message>
   <message type="1">Some text</message>
   <message type="2">Some text</message>
</sms>
<sms contact_name="Person3">
   <message type="1">Some text</message>
   <message type="2">Some text</message>
</sms>
<sms contact_name="(Unknown)">
   <message type="2">Some text</message>
</sms>
<sms contact_name="(Unknown)">
   <message type="1">Some text</message>
</sms>

Update: The OP has edited his question and has posted new requirements that the address attribute may or maynot start with a country code. Two addresses, one with contry code and the other without country code are "the same" if the substring after the country code is equal to the other address. In this case the two elements should be grouped together.

Here is the solution (it would be trivial to write in XSLT 2.0, but in XSLT 1.0 to do so in a single pass is quite tricky. Amultipass solution is more easy, but it would generally require the xxx:node-set() extension function and would thus lose portability):

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:key name="kContactByNameAddress" match="sms"
  use="concat(@contact_name,'+',
              concat(substring(@address,
                               4 div starts-with(@address,'+')),
                     substring(@address,
                               1 div not(starts-with(@address,'+'))
                              )
                     )
              )"/>

 <xsl:template match=
    "sms[generate-id()
        =
         generate-id(key('kContactByNameAddress',
                         concat(@contact_name,'+',
                                concat(substring(@address,
                                                 4 div starts-with(@address,'+')),
                                       substring(@address,
                                                 1 div not(starts-with(@address,'+'))
                                                 )
                                       )
                                 )
                         )
                         [1]
                     )
        ]
    ">
     <sms contact_name="{@contact_name}">
       <xsl:apply-templates mode="inGroup"
       select="key('kContactByNameAddress',
                 concat(@contact_name,'+',
                        concat(substring(@address,
                                         4 div starts-with(@address,'+')),
                               substring(@address,
                                         1 div not(starts-with(@address,'+'))
                                         )
                                )
                        )
                  )
      "/>
     </sms>
 </xsl:template>

 <xsl:template match="sms" mode="inGroup">
       <message type="{@type}">
         <xsl:value-of select="@body"/>
       </message>
 </xsl:template>
 <xsl:template match="sms"/>
</xsl:stylesheet>

When this transformation is applied on the following XML document (the previous one + added three sms elements with contact_name="Jared", two of which have "identical" addresses, according to the newly posted rules):

<smses>
    <sms address="87654321" type="1" body="Some text"
        readable_date="3/09/2011 2:16:52 PM" contact_name="Person1" />
    <sms address="87654321" type="2" body="Some text"
        readable_date="3/09/2011 2:36:41 PM" contact_name="Person1" />
    <sms address="87654321" type="1" body="Some text"
        readable_date="3/09/2011 2:16:52 PM" contact_name="Person1" />
    <sms address="123" type="2" body="Some text"
        readable_date="3/09/2011 10:56:24 AM" contact_name="Person2" />
    <sms address="123" type="1" body="Some text"
        readable_date="3/09/2011 10:57:52 AM" contact_name="Person2" />
    <sms address="123" type="2" body="Some text"
        readable_date="3/09/2011 10:56:24 AM" contact_name="Person2" />
    <sms address="12345678" type="1" body="Some text"
        readable_date="3/09/2011 11:21:16 AM" contact_name="Person3" />
  <sms contact_name="jared" address="12345" type="2" body="Some text"/>
  <sms contact_name="jared" address="56789" type="1" body="Some text"/>
  <sms contact_name="jared" address="+6412345" type="2" body="Some text"/>
    <sms address="12345678" type="2" body="Some text"
        readable_date="3/09/2011 11:37:21 AM" contact_name="Person3" />
    <sms address="12345" type="2" body="Some text"
        readable_date="28/01/2011 7:24:50 PM" contact_name="(Unknown)" />
    <sms address="233" type="1" body="Some text"
        readable_date="30/12/2010 1:13:41 PM" contact_name="(Unknown)" />
</smses>

the wanted, correct result is produced:

<sms contact_name="Person1">
   <message type="1">Some text</message>
   <message type="2">Some text</message>
   <message type="1">Some text</message>
</sms>
<sms contact_name="Person2">
   <message type="2">Some text</message>
   <message type="1">Some text</message>
   <message type="2">Some text</message>
</sms>
<sms contact_name="Person3">
   <message type="1">Some text</message>
   <message type="2">Some text</message>
</sms>
<sms contact_name="jared">
   <message type="2">Some text</message>
   <message type="2">Some text</message>
</sms>
<sms contact_name="jared">
   <message type="1">Some text</message>
</sms>
<sms contact_name="(Unknown)">
   <message type="2">Some text</message>
</sms>
<sms contact_name="(Unknown)">
   <message type="1">Some text</message>
</sms>

Detailed explanation:

The main difficulty in this problem arises from the fact that there is no "if... then ... else" operator in XPath 1.0, however we must specify a single XPath expression in the use attribute of the xsl:key instruction, that either selects the address attribute (when it doesn't start with "+") or its substring after the country code (if its string value starts with "+").

Here I am using this poor man's implementation of

if($condition)
  then $string1
  else $string2

The following XPath expression, when evaluated is equivalent to the above:

concat(substring($string1, 1 div $condition),
       substring($string2, 1 div not($condition))
      )

This equivalence follows from the fact that 1 div true() is the same as 1 div 1 and this is 1, while 1 div false() is the same as 1 div 0 and that is the number (positive) Infinity.

Also, for any string $s, the value of substring($s, Infinity) is just the empty string. And, of course, for any string $s the value of substring($s, 1) is just the string $s itself.

II. XSLT 2.0 solution:

<xsl:stylesheet version="2.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="/*">
  <xsl:for-each-group select="sms" group-by=
   "concat(@contact_name,'+',
           if(starts-with(@address,'+'))
             then substring(@address, 4)
             else @address
           )">
     <sms contact_name="{@contact_name}">
      <xsl:apply-templates select="current-group()"/>
     </sms>

  </xsl:for-each-group>
 </xsl:template>

 <xsl:template match="sms">
       <message type="{@type}">
         <xsl:value-of select="@body"/>
       </message>
 </xsl:template>
</xsl:stylesheet>

when this (much simpler!)XSLT 2.0 transformation is applied on the same XML document (above), the same correct output is produced:

<sms contact_name="Person1">
   <message type="1">Some text</message>
   <message type="2">Some text</message>
   <message type="1">Some text</message>
</sms>
<sms contact_name="Person2">
   <message type="2">Some text</message>
   <message type="1">Some text</message>
   <message type="2">Some text</message>
</sms>
<sms contact_name="Person3">
   <message type="1">Some text</message>
   <message type="2">Some text</message>
</sms>
<sms contact_name="jared">
   <message type="2">Some text</message>
   <message type="2">Some text</message>
</sms>
<sms contact_name="jared">
   <message type="1">Some text</message>
</sms>
<sms contact_name="(Unknown)">
   <message type="2">Some text</message>
</sms>
<sms contact_name="(Unknown)">
   <message type="1">Some text</message>
</sms>
like image 169
Dimitre Novatchev Avatar answered Sep 23 '22 09:09

Dimitre Novatchev