Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Exclude certain child nodes when data structure is unknown

EDIT - I've figured out the solution to my problem and posted a Q&A here.

I'm looking to process XML conforming to the Library of Congress EAD standard (found here). Unfortunately, the standard is very loose regarding the structure of the XML.

For example the <bioghist> tag can exist within the <archdesc> tag, or within a <descgrp> tag, or nested within another <bioghist> tag, or a combination of the above, or can be left out entirely. I've found it to be very difficult to select just the bioghist tag I'm looking for without also selecting others.

Below are a few different possible EAD XML documents my XSLT might have to process:

First example

<ead>
<eadheader>
    <archdesc>
        <bioghist>one</bioghist>
        <dsc>
            <c01>
                <descgrp>
                    <bioghist>two</bioghist>
                </descgrp>
                <c02>
                    <descgrp>
                        <bioghist>
                            <bioghist>three</bioghist>
                        </bioghist>
                    </descgrp>
                </c02>
            </c01>
        </dsc>
    </archdesc>
</eadheader>
</ead>

Second example

<ead>
<eadheader>
    <archdesc>
        <descgrp>
            <bioghist>
                <bioghist>one</bioghist>
            </bioghist>
        </descgrp>
        <dsc>
            <c01>
                <c02>
                    <descgrp>
                        <bioghist>three</bioghist>
                    </descgrp>
                </c02>
                <bioghist>two</bioghist>
            </c01>
        </dsc>
    </archdesc>
</eadheader>
</ead>

Third example

<ead>
<eadheader>
    <archdesc>
        <descgrp>
            <bioghist>one</bioghist>
        </descgrp>
        <dsc>
            <c01>
                <c02>
                    <bioghist>three</bioghist>
                </c02>
            </c01>
        </dsc>
    </archdesc>
</eadheader>
</ead>

As you can see, an EAD XML file might have a <bioghist> tag almost anywhere. The actual output I'm suppose to produce is too complicated to post here. A simplified example of the output for the above three EAD examples might be like:

Output for First example

<records>
<primary_record>
    <biography_history>first</biography_history>
</primary_record>
<child_record>
    <biography_history>second</biography_history>
</child_record>
<granchild_record>
    <biography_history>third</biography_history>
</granchild_record>
</records>

Output for Second example

<records>
<primary_record>
    <biography_history>first</biography_history>
</primary_record>
<child_record>
    <biography_history>second</biography_history>
</child_record>
<granchild_record>
    <biography_history>third</biography_history>
</granchild_record>
</records>

Output for Third example

<records>
<primary_record>
    <biography_history>first</biography_history>
</primary_record>
<child_record>
    <biography_history></biography_history>
</child_record>
<granchild_record>
    <biography_history>third</biography_history>
</granchild_record>
</records>

If I want to pull the "first" bioghist value and put that in the <primary_record>, I can't simply <xsl:apply-templates select="/ead/eadheader/archdesc/bioghist", as that tag might not be a direct descendant of the <archdesc> tag. It might be wrapped by a <descgrp> or a <bioghist> or a combination thereof. And I can't select="//bioghist", because that will pull all the <bioghist> tags. I can't even select="//bioghist[1]" because there might not actually be a <bioghist> tag there and then I'll be pulling the value below the <c01>, which is "Second" and should be processed later.

This is already a long post, but one other wrinkle is that there can be an unlimited number of <cxx> nodes, nested up to twelve levels deep. I'm currently processing them recursively. I've tried saving the node I'm currently processing (<c01> for example) as a variable called 'RN', then running <xsl:apply-templates select=".//bioghist [name(..)=name($RN) or name(../..)=name($RN)]">. This works for some forms of EAD, where the <bioghist> tag isn't nested too deeply, but it will fail if it ever has to process an EAD file created by someone who loves wrapping tags in other tags (which is totally fine according to the EAD Standard).

What I'd love is someway of saying

  • Get any <bioghist> tag anywhere below the current node but
  • don't dig deeper if you hit a <c??> tag

I hope that I've made the situation clear. Please let me know if I've left anything ambiguous. Any assistance you can provide would be greatly appreciated. Thanks.

like image 779
aarondev Avatar asked Nov 03 '22 21:11

aarondev


1 Answers

As the requirements are rather vague, any answer only reflects the guesses its author has made.

Here is mine:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:my="my:my" exclude-result-prefixes="my">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <my:names>
  <n>primary_record</n>
  <n>child_record</n>
  <n>grandchild_record</n>
 </my:names>

 <xsl:variable name="vNames" select="document('')/*/my:names/*"/>

 <xsl:template match="/">
  <xsl:apply-templates select=
   "//bioghist[following-sibling::node()[1]
                                [self::descgrp]
              ]"/>
 </xsl:template>

 <xsl:template match="bioghist">
  <xsl:variable name="vPos" select="position()"/>

  <xsl:element name="{$vNames[position() = $vPos]}">
   <xsl:value-of select="."/>
  </xsl:element>
 </xsl:template>

 <xsl:template match="text()"/>
</xsl:stylesheet>

When this transformation is applied on the provided XML document:

<ead>
    <eadheader>
        <archdesc>
            <bioghist>first</bioghist>
            <descgrp>
                <bioghist>first</bioghist>
                <bioghist>
                    <bioghist>first</bioghist></bioghist>
            </descgrp>
            <dsc>
                <c01>
                    <bioghist>second</bioghist>
                    <descgrp>
                        <bioghist>second</bioghist>
                        <bioghist>
                            <bioghist>second</bioghist></bioghist>
                    </descgrp>
                    <c02>
                        <bioghist>third</bioghist>
                        <descgrp>
                            <bioghist>third</bioghist>
                            <bioghist>
                                <bioghist>third</bioghist></bioghist>
                        </descgrp>
                    </c02>
                </c01>
            </dsc>
        </archdesc>
    </eadheader>
</ead>

the wanted result is produced:

<primary_record>first</primary_record>
<child_record>second</child_record>
<grandchild_record>third</grandchild_record>
like image 93
Dimitre Novatchev Avatar answered Nov 09 '22 08:11

Dimitre Novatchev