Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to flatten an XML file into a set of xpath expressions?

Tags:

java

xml

xpath

Consider I have the following example XML file:

<ns1:create xmlns:ns1='http://predic8.com/wsdl/material/ArticleService/1/'>
   <article xmlns:ns1='http://predic8.com/material/1/'>
      <name xmlns:ns1='http://predic8.com/material/1/'>foo</name>
      <description xmlns:ns1='http://predic8.com/material/1/'>bar</description>
      <price xmlns:ns1='http://predic8.com/common/1/'>
         <amount xmlns:ns1='http://predic8.com/common/1/'>00.00</amount>
         <currency xmlns:ns1='http://predic8.com/common/1/'>USD</currency>
      </price>
      <id xmlns:ns1='http://predic8.com/material/1/'>1</id>
   </article>
</ns1:create>

What would be the best (most efficient) way to flatten this into a set of xpath expressions. Note also: I want to ignore any namespace and attribute information. (If needed, this could also be done as a pre-processing step).

So I want get as output:

/create/article/name
/create/article/description
/create/article/price/amount
/create/article/price/currency
/create/article/id

I’m implementing in Java.

EDIT: PS, I might also need this to work in the case that there is no data at the text node, so for example, this following should generate the same output as the above:

<ns1:create xmlns:ns1='http://predic8.com/wsdl/material/ArticleService/1/'>
  <article xmlns:ns1='http://predic8.com/material/1/'>
    <name />
    <description />
    <price xmlns:ns1='http://predic8.com/common/1/'>
      <amount />
      <currency xmlns:ns1='http://predic8.com/common/1/'></currency>
    </price>
    <id xmlns:ns1='http://predic8.com/material/1/'></id>
  </article>
</ns1:create>
like image 771
Larry Avatar asked Oct 23 '22 17:10

Larry


1 Answers

You could do this pretty easily with XSLT. Looking at your examples, it seems like you only want the XPath of elements that contain text. If that's not the case, let me know and I can update the XSLT.

I created a new input example to show how it handles siblings with the same name. In this case, <article>.

XML Input

<ns1:create xmlns:ns1='http://predic8.com/wsdl/material/ArticleService/1/'>
    <article xmlns:ns1='http://predic8.com/material/1/'>
        <name xmlns:ns1='http://predic8.com/material/1/'>foo</name>
        <description xmlns:ns1='http://predic8.com/material/1/'>bar</description>
        <price xmlns:ns1='http://predic8.com/common/1/'>
            <amount xmlns:ns1='http://predic8.com/common/1/'>00.00</amount>
            <currency xmlns:ns1='http://predic8.com/common/1/'>USD</currency>
        </price>
        <id xmlns:ns1='http://predic8.com/material/1/'>1</id>
    </article>
    <article xmlns:ns1='http://predic8.com/material/2/'>
        <name xmlns:ns1='http://predic8.com/material/2/'>some name</name>
        <description xmlns:ns1='http://predic8.com/material/2/'>some description</description>
        <price xmlns:ns1='http://predic8.com/common/2/'>
            <amount xmlns:ns1='http://predic8.com/common/2/'>00.01</amount>
            <currency xmlns:ns1='http://predic8.com/common/2/'>USD</currency>
        </price>
        <id xmlns:ns1='http://predic8.com/material/2/'>2</id>
    </article>
</ns1:create>

XSLT 1.0

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="text"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="text()"/>

    <xsl:template match="*[text()]">
        <xsl:call-template name="genPath"/>
        <xsl:apply-templates select="node()|@*"/>
    </xsl:template>

    <xsl:template name="genPath">
        <xsl:param name="prevPath"/>
        <xsl:variable name="currPath" select="concat('/',local-name(),'[',
        count(preceding-sibling::*[name() = name(current())])+1,']',$prevPath)"/>
        <xsl:for-each select="parent::*">
            <xsl:call-template name="genPath">
                <xsl:with-param name="prevPath" select="$currPath"/>
            </xsl:call-template>
        </xsl:for-each>
        <xsl:if test="not(parent::*)">
            <xsl:value-of select="$currPath"/>
            <xsl:text>&#xA;</xsl:text>
        </xsl:if>
    </xsl:template>

</xsl:stylesheet>

Output

/create[1]/article[1]/name[1]
/create[1]/article[1]/description[1]
/create[1]/article[1]/price[1]/amount[1]
/create[1]/article[1]/price[1]/currency[1]
/create[1]/article[1]/id[1]
/create[1]/article[2]/name[1]
/create[1]/article[2]/description[1]
/create[1]/article[2]/price[1]/amount[1]
/create[1]/article[2]/price[1]/currency[1]
/create[1]/article[2]/id[1]

UPDATE

For the XSLT to work for all elements, simply remove the [text()] predicate from match="*[text()]". This will output the path for every element. If you don't want the path output for elements that contain other elements (like create, article, and price) add the predicate [not(*)]. Here's an updated example:

New XML Input

<ns1:create xmlns:ns1='http://predic8.com/wsdl/material/ArticleService/1/'>
    <article xmlns:ns1='http://predic8.com/material/1/'>
        <name />
        <description />
        <price xmlns:ns1='http://predic8.com/common/1/'>
            <amount />
            <currency xmlns:ns1='http://predic8.com/common/1/'></currency>
        </price>
        <id xmlns:ns1='http://predic8.com/material/1/'></id>
    </article>
    <article xmlns:ns1='http://predic8.com/material/2/'>
        <name xmlns:ns1='http://predic8.com/material/2/'>some name</name>
        <description xmlns:ns1='http://predic8.com/material/2/'>some description</description>
        <price xmlns:ns1='http://predic8.com/common/2/'>
            <amount xmlns:ns1='http://predic8.com/common/2/'>00.01</amount>
            <currency xmlns:ns1='http://predic8.com/common/2/'>USD</currency>
        </price>
        <id xmlns:ns1='http://predic8.com/material/2/'>2</id>
    </article>
</ns1:create>

XSLT 1.0

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="text"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="text()"/>

    <xsl:template match="*[not(*)]">
        <xsl:call-template name="genPath"/>
        <xsl:apply-templates select="node()"/>
    </xsl:template>

    <xsl:template name="genPath">
        <xsl:param name="prevPath"/>
        <xsl:variable name="currPath" select="concat('/',local-name(),'[',
            count(preceding-sibling::*[name() = name(current())])+1,']',$prevPath)"/>
        <xsl:for-each select="parent::*">
            <xsl:call-template name="genPath">
                <xsl:with-param name="prevPath" select="$currPath"/>
            </xsl:call-template>
        </xsl:for-each>
        <xsl:if test="not(parent::*)">
            <xsl:value-of select="$currPath"/>
            <xsl:text>&#xA;</xsl:text>
        </xsl:if>
    </xsl:template>

</xsl:stylesheet>

Output

/create[1]/article[1]/name[1]
/create[1]/article[1]/description[1]
/create[1]/article[1]/price[1]/amount[1]
/create[1]/article[1]/price[1]/currency[1]
/create[1]/article[1]/id[1]
/create[1]/article[2]/name[1]
/create[1]/article[2]/description[1]
/create[1]/article[2]/price[1]/amount[1]
/create[1]/article[2]/price[1]/currency[1]
/create[1]/article[2]/id[1]

If you remove the [not(*)] predicate, this is what the output looks like (a path is output for every element):

/create[1]
/create[1]/article[1]
/create[1]/article[1]/name[1]
/create[1]/article[1]/description[1]
/create[1]/article[1]/price[1]
/create[1]/article[1]/price[1]/amount[1]
/create[1]/article[1]/price[1]/currency[1]
/create[1]/article[1]/id[1]
/create[1]/article[2]
/create[1]/article[2]/name[1]
/create[1]/article[2]/description[1]
/create[1]/article[2]/price[1]
/create[1]/article[2]/price[1]/amount[1]
/create[1]/article[2]/price[1]/currency[1]
/create[1]/article[2]/id[1]

Here's another version of the XSLT that is about 65% faster:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="text"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="text()"/>

    <xsl:template match="*[not(*)]">
        <xsl:for-each select="ancestor-or-self::*">
            <xsl:value-of select="concat('/',local-name(),'[',count(preceding-sibling::*[local-name()=local-name(current())])+1,']')"/>
        </xsl:for-each>
        <xsl:text>&#xA;</xsl:text>
        <xsl:apply-templates select="node()"/>
    </xsl:template>

</xsl:stylesheet>
like image 192
Daniel Haley Avatar answered Nov 08 '22 05:11

Daniel Haley