Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

XPath to recursively remove empty DOM nodes?

Tags:

dom

php

xpath

I am trying to find a way to cleanup a bunch of empty DOM elements from an HTML source like this:

<div class="empty">
    <div>&nbsp;</div>
    <div></div>
</div>
<a href="http://example.com">good</a>
<div>
    <p></p>
</div>
<br>
<img src="http://example.com/logo.png" />
<div></div>

However, I don't want to harm valid elements or line breaks. So the result should be something like this:

<a href="http://example.com">good</a>
<br>
<img src="http://example.com/logo.png" />

So far I have tried some XPaths like this:

$xpath = new DOMXPath($dom);

//$x = '//*[not(*) and not(normalize-space(.))]';
//$x = '//*[not(text() or node() or self::br)]';
//$x = 'not(normalize-space(.) or self::br)';
$x = '//*[not(text() or node() or self::br)]';

while(($nodeList = $xpath->query($x)) && $nodeList->length > 0) {
    foreach ($nodeList as $node) {
        $node->parentNode->removeChild($node);
    }
}

Can someone show me the correct XPath to remove empty DOM nodes that serve no purpose if empty? (img, br, and input serve a purpose even if empty)

Current output:

<div>
    <div>&nbsp;</div>

</div>
<a href="http://example.com">good</a>
<div>

</div>
<br>

Update

To clarify, I am looking for an XPath query that is either:

  • Recursive in matching empty nodes until all are found (including parents of empty nodes)
  • Can be successfully run multiple times after each cleanup (as shown in my example)
like image 878
Xeoncross Avatar asked Dec 27 '22 20:12

Xeoncross


1 Answers

I. Initial solution:

XPath is a query language for XML documents. As such, the evaluation of an XPath expression only selects nodes or extracts non-node information from the XML documen, but never alters the XML document. Thus evaluating an XPath expression never deletes or inserts nodes -- the XML document remains the same.

What you want is "to cleanup a bunch of empty DOM elements from an HTML source" and cannot be done with XPath alone.

This is confirmed by the most credible and the only official (we say normative) source on XPath -- the W3C XPath 1.0 Recommendation:

"The primary purpose of XPath is to address parts of an XML [XML] document. In support of this primary purpose, it also provides basic facilities for manipulation of strings, numbers and booleans. XPath uses a compact, non-XML syntax to facilitate use of XPath within URIs and XML attribute values. XPath operates on the abstract, logical structure of an XML document, rather than its surface syntax. XPath gets its name from its use of a path notation as in URLs for navigating through the hierarchical structure of an XML document. "

Therefore, some additional language must be used in conjuction with XPath in order to implement the require functionality.

XSLT is a language especially designed for XML transformation.

Here is an XSLT - based example -- a short and simple XSLT transformation that performs the requested cleanup:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="node()|@*">
  <xsl:copy>
   <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
 </xsl:template>

 <xsl:template match=
 "*[not(string(translate(., '&#xA0;', '')))
  and
    not(descendant-or-self::*
          [self::img or self::input or self::br])]"/>
</xsl:stylesheet>

When applied on the provided XML (corrected to become wellformed XML document):

<html>
    <div class="empty">
        <div>&#xA0;</div>
        <div></div>
    </div>
    <a href="http://example.com">good</a>
    <div>
        <p></p>
    </div>
    <br />
    <img src="http://example.com/logo.png" />
    <div></div>
</html>

the wanted, correct result is produced:

<html>
   <a href="http://example.com">good</a>
   <br/>
   <img src="http://example.com/logo.png"/>
</html>

Explanation:

  1. The identity rule copies "as-is" every node for which it is selected for execution.

  2. There is a single template, overriding the identity template for any element (with the exception of img, input and br), whose string value from which any &nbsp; has been removed, is the empty string. The body of this template is empty, which effectively "deletes" the matched element -- the matched element isn't copied to the output.


II. Update:

The OP clarifies that he wants one or more XPath expressions that:

"Can be successfully run multiple times after each cleanup."

Interestingly enough, there exists a single XPath expression that selects exactly all nodes that need to be deleted -- therefore "multiple cleanups" are completely avoided:

//*[not(normalize-space((translate(., '&#xA0;', ''))))
  and
    not(descendant-or-self::*[self::img or self::input or self::br])
    ]
     [not(ancestor::*
             [count(.| //*[not(normalize-space((translate(., '&#xA0;', ''))))
                         and
                           not(descendant-or-self::*
                                  [self::img or self::input or self::br])
                          ]
                    )
             =
              count(//*[not(normalize-space((translate(., '&#xA0;', ''))))
                      and
                        not(descendant-or-self::*
                                 [self::img or self::input or self::br])
                        ]
                   )
              ]
          )
     ]

XSLT-based verification:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="node()|@*">
  <xsl:copy>
   <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
 </xsl:template>

 <xsl:template match=
   "//*[not(normalize-space((translate(., '&#xA0;', ''))))
      and
        not(descendant-or-self::*[self::img or self::input or self::br])
       ]
        [not(ancestor::*
               [count(.| //*[not(normalize-space((translate(., '&#xA0;', ''))))
                           and
                             not(descendant-or-self::*
                                    [self::img or self::input or self::br])
                             ]
                      )
               =
                count(//*[not(normalize-space((translate(., '&#xA0;', ''))))
                        and
                          not(descendant-or-self::*
                                 [self::img or self::input or self::br])
                          ]
                      )
               ]
            )
        ]
 "/>
</xsl:stylesheet>

When this transformation is applied on the provided (and made wellformed) XML document (above), all nodes are copied "as-is" with the exception of the nodes selected by our XPath expression:

<html>
   <a href="http://example.com">good</a>
   <br/>
   <img src="http://example.com/logo.png"/>
</html>

Explanation:

Let us denote with $vAllEmpty all the nodes that are "empty" according to the definition of "empty" in the question.

$vAllEmpty is expressed with the following XPath expression:

   //*[not(normalize-space((translate(., '&#xA0;', ''))))
     and
       not(descendant-or-self::*
             [self::img or self::input or self::br])

      ]

For all of these to be deleted, we need to delete just the "top nodes" from $vAllEmpty

Let us denote the set of all such "top nodes" as: $vTopEmpty.

$vTopEmpty can be expressed from $vAllEmpty using the following XPath 2.0 expression:

$vAllEmpty[not(ancestor::* intersect $vAllEmpty)]

this selects those nodes from $vAllEmpty that don't have any ancestor element that is also in $vAllEmpty.

The last XPath expression has its equivalent XPath 1.0 expression:

$vAllEmpty[not(ancestor::*[count(.|$vAllEmpty) = count($vAllEmpty)])]

Now, we replace in the last expression $vAllEmpty with its expanded XPath expression as defined above and this is how we obtain the final expression, that selects only the "top nodes to delete":

//*[not(normalize-space((translate(., '&#xA0;', ''))))
  and
    not(descendant-or-self::*[self::img or self::input or self::br])
    ]
     [not(ancestor::*
             [count(.| //*[not(normalize-space((translate(., '&#xA0;', ''))))
                         and
                           not(descendant-or-self::*
                                  [self::img or self::input or self::br])
                          ]
                    )
             =
              count(//*[not(normalize-space((translate(., '&#xA0;', ''))))
                      and
                        not(descendant-or-self::*
                                 [self::img or self::input or self::br])
                        ]
                   )
              ]
          )
     ]

Short XSLT-2.0 - based verification using variables:

<xsl:stylesheet version="2.0"
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
     <xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
     <xsl:strip-space elements="*"/>

     <xsl:variable name="vAllEmpty" select=
      "//*[not(normalize-space((translate(., '&#xA0;', ''))))
         and
           not(descendant-or-self::*
                 [self::img or self::input or self::br])

          ]"/>

     <xsl:variable name="vTopEmpty" select=
     "$vAllEmpty[not(ancestor::* intersect $vAllEmpty)]"/>

     <xsl:template match="node()|@*">
      <xsl:copy>
       <xsl:apply-templates select="node()|@*"/>
      </xsl:copy>
     </xsl:template>

     <xsl:template match="*[. intersect $vTopEmpty]"/>
</xsl:stylesheet>

This transformation copies every node "as-is" with the exception of any node that belongs to $vTopEmpty . The result is the correct and expected one:

<html>
   <a href="http://example.com">good</a>
   <br/>
   <img src="http://example.com/logo.png"/>
</html>

III. Alternative solution (may require "multiple cleanups"):

An alternative approach is not to attempt to specify the nodes to delete, but to specify the nodes to keep -- then the nodes to delete are the set difference between all nodes and the nodes to keep.

The nodes to keep are selected by this XPath expression:

  //node()
    [self::input or self::img or self::br
    or
     self::text()[normalize-space(translate(.,'&#xA0;',''))]
    ]
     /ancestor-or-self::node()

Then the nodes to delete are:

  //node()
     [not(count(.
              |
                //node() 
                   [self::input or self::img or self::br
                  or
                    self::text()[normalize-space(translate(.,'&#xA0;',''))]
                   ]
                    /ancestor-or-self::node()
                )
        =
         count(//node()
                  [self::input or self::img or self::br
                 or
                   self::text()[normalize-space(translate(.,'&#xA0;',''))]
                  ]
                   /ancestor-or-self::node()
               )
         )
     ]

However, do note that these are all nodes to delete and not only the "top nodes to delete". It is possible to express only the "top nodes to delete", but the resulting expression is rather complicated. If one attempts to delete all-nodes-to delete, there will be errors due to the fact that the descendants of the "top nodes to delete" follow them in document order.

like image 134
Dimitre Novatchev Avatar answered Dec 31 '22 13:12

Dimitre Novatchev