Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get count of distinct XML nodes?

Tags:

php

xml

I'm having trouble using references in recursive calls.

What I am trying to accomplish is to describe an XML document in terms of the max count of distinct nodes within a respective element - without knowing any of the node element names in advance.

Consider this document:

<Data>
    <Record>
        <SAMPLE>
            <TITLE>Superior Title</TITLE>
            <SUBTITLE>Sub Title</SUBTITLE>
            <AUTH>
                <FNAME>John</FNAME>
                <DISPLAY>No</DISPLAY>
            </AUTH>
            <AUTH>
                <FNAME>Jane</FNAME>
                <DISPLAY>No</DISPLAY>
            </AUTH>
            <ABSTRACT/>
        </SAMPLE>
    </Record>
    <Record>
        <SAMPLE>
            <TITLE>Interesting Title</TITLE>
            <AUTH>
                <FNAME>John</FNAME>
                <DISPLAY>No</DISPLAY>
            </AUTH>
            <ABSTRACT/>
        </SAMPLE>
        <SAMPLE>
            <TITLE>Another Title</TITLE>
            <AUTH>
                <FNAME>Jane</FNAME>
                <DISPLAY>No</DISPLAY>
            </AUTH>
            <ABSTRACT/>
        </SAMPLE>
    </Record>
</Data>

You can see that a Record has either 1 or 2 SAMPLE nodes and that the SAMPLE has 1 or 2 AUTH nodes. I'm trying to produce an array that will describe the structure of the document in terms of the max number of distinct nodes inside a respective node.

So I'm trying to get a result like this:

$result = [

  "Data" => [
    "max_count" => 1,
    "elements" => [

      "Record" => [
        "max_count" => 2,
        "elements" => [

          "SAMPLE" => [
            "max_count" => 2,
            "elements" => [

              "TITLE" => [
                "max_count" => 1
              ],
              "SUBTITLE" => [
                "max_count" => 1
              ],
              "AUTH" => [
                "max_count" => 2,
                "elements" => [

                  "FNAME" => [
                    "max_count" => 1
                  ],
                  "DISPLAY" => [
                    "max_count" => 1
                  ]

                ]
              ],
              "ABSTRACT" => [
                "max_count" => 1
              ]

            ]
          ]

        ]
      ]

    ]
  ]

];

To preserve a little bit of my sanity, I'm using sabre/xml to do the legwork parsing the XML.

I can get an absolute count of elements using recursive calls with a reference to the original array.

  private function countArrayElements(&$array, &$result){
    // get collection of subnodes
    foreach ($array as $node){

      $name = $this->stripNamespace($node['name']);

      // get count of distinct subnodes
      if (empty($result[$name])){
        $result[$name]["max_count"] = 1;
      } else {
        $result[$name]["max_count"]++;
      }

      if (is_array($node['value'])){
        $this->countArrayElements($node['value'], $result[$name]["elements"]);
      }

    }
  }

So my reasoning was that I could also pass the array by reference and do a comparison, which works for the top two nodes, but somehow resets on the subsequent nodes which results in a count of only 1 for the AUTH node.

  private function countArrayElements(&$array, &$previous){

    // get collection of subnodes
    foreach ($array as $node){

      $name = $this->stripNamespace($node['name']);

      // get count of distinct subnodes
      if (empty($result[$name]["max_count"])){
        $result[$name]["max_count"] = 1;
      } else {
        $result[$name]["max_count"]++;
      }

      // recurse
      if (is_array($node['value'])){
        $result[$name]["elements"] = $this->countArrayElements(
          $node['value'],
          $result[$name]["elements"]
        );
      }

      // compare previous max
      if (!empty($previous[$name]["max_count"])){
        $result[$name]["max_count"] = max(
          $previous[$name]["max_count"],
          $result[$name]["max_count"]
        );
      }

    }

    return $result;

  }

I realize this is a pretty complex question, of which it is just a small piece of a much larger project, so I have tried to break it down as much as possible for this MCVE and I have additionally prepared a special repository of these files complete with a phpunit test.

like image 662
Jeff Puckett Avatar asked Mar 11 '23 06:03

Jeff Puckett


1 Answers

While your solution works, and pretty efficiently given that it operates in O(n*k) time (where n is the number of nodes in the tree and k is the number of vertices), I figured I'd propose an alternative solution that doesn't rely on arrays or references and is more generalized to work, not just for XML, but for any DOM tree. This solution also operates in O(n*k) time, so it's just as efficient. The only difference is you can consume the values from a generator without having to build out the entire array first.

Modeling The Problem

The easiest way for me to understand this problem is to model it as a graph. If we model the document that way what we get are levels and vertices.

DOM tree figure1

So effectively, this allows us to divide and conquer, breaking down the problem into two distinct steps.

  1. Count the cardinal child node names of a given vertical as the sum (verticies)
  2. Find the max of the collective sum on the horizontal (levels)

Which means that if we do level-order traversal on this tree we should be able to easily produce the cardinality of node names as the maximum sum of all the verticals.

DOM tree figure2

In other words, there's a cardinality problem of getting the distinct child node names of each node. Then there's the issue of finding the maximum sum for that entire level.

Minimal, Complete, Verifiable, Self-Contained Example

So to provide a Minimal, Complete, Verifiable, and Self-Contained Example I'm going to rely on extending PHP's DOMDocument rather than the third party XML library you're using in your example.

It's probably worth noting that this code is not backwards compatible with PHP 5, (because of the use of yield from), so you must use PHP 7 for this implementation to work.

First, I'm going to implement a function in DOMDocument that allows us to iterate over the DOM tree in level-order by using a generator.

class SpecialDOM extends DOMDocument {
    public function level(DOMNode $node = null, $level = 0, $ignore = ["#text"]) {
        if (!$node) {
            $node = $this;
        }
        $stack = [];
        if ($node->hasChildNodes()) {
            foreach($node->childNodes as $child) {
                if (!in_array($child->nodeName, $ignore, true)) {
                    $stack[] = $child;
                }
            }
        }
        if ($stack) {
            yield $level => $stack;
            foreach($stack as $node) {
                yield from $this->level($node, $level + 1, $ignore);
            }
        }
    }
}

The mechanics of the function itself is actually quite simple. It doesn't rely on passing around arrays or using references, but instead uses the DOMDocument object itself, to build a stack of all child nodes in a given node. Then it can yield this entire stack at once. This is the level part. At which point we rely on recursion to yield from each element in this stack any other nodes on the next level.

Here's a very simple XML document to demonstrate how straight-forward this is.

$xml = <<<'XML'
<?xml version="1.0" encoding="UTF-8"?>

<Data>
    <Record>
        <SAMPLE>Some Sample</SAMPLE>
    </Record>
    <Note>
        <SAMPLE>Some Sample</SAMPLE>
    </Note>
    <Record>
        <SAMPLE>Sample 1</SAMPLE>
        <SAMPLE>Sample 2</SAMPLE>
    </Record>
</Data>
XML;

$dom = new SpecialDOM;
$dom->loadXML($xml);

foreach($dom->level() as $level => $stack) {
    echo "- Level $level\n";
    foreach($stack as $item => $node) {
        echo "$item => $node->nodeName\n";
    }
}

The output will look like this.

- Level 0
0 => Data
- Level 1
0 => Record
1 => Note
2 => Record
- Level 2
0 => SAMPLE
- Level 2
0 => SAMPLE
- Level 2
0 => SAMPLE
1 => SAMPLE

So at least now we have a way of knowing what level a node is on and in what order it appears on that level, which is useful for what we intend to do.

Now the idea of building a nested array is actually unnecessary to obtain the cardinality sought by max_count. Because we already have access to the nodes themselves from the DOM tree. Which means we know what elements are contained therein inside of our loop at each iteration. We don't have to generate the entire array at once to begin exploring it. We can do this at a level-order instead, which is actually really cool, because it means you can build a flat array to get to max_count for each record.

Let me demonstrate how that would work.

$max = [];
foreach($dom->level() as $level => $stack) {
    $sum = [];
    foreach($stack as $item => $node) {
        $name = $node->nodeName;
        // the sum
        if (!isset($sum[$name])) {
            $sum[$name] = 1;
        } else {
            $sum[$name]++;
        }
        // the maximum
        if (!isset($max[$level][$name])) {
            $max[$level][$name] = 1;
        } else {
            $max[$level][$name] = max($sum[$name], $max[$level][$name]);
        }
    }
}

var_dump($max);

The output we get would look like this.

array(3) {
  [0]=>
  array(1) {
    ["Data"]=>
    int(1)
  }
  [1]=>
  array(2) {
    ["Record"]=>
    int(2)
    ["Note"]=>
    int(1)
  }
  [2]=>
  array(1) {
    ["SAMPLE"]=>
    int(2)
  }
}

Which proves that we can calculate max_count without the need for references or complex nested arrays. It's also easier to wrap your head around when you obviate the one-way mapping semantics of PHP arrays.

Synopsis

Here's the resulting output from this code on your sample XML document.

array(5) {
  [0]=>
  array(1) {
    ["Data"]=>
    int(1)
  }
  [1]=>
  array(1) {
    ["Record"]=>
    int(2)
  }
  [2]=>
  array(1) {
    ["SAMPLE"]=>
    int(2)
  }
  [3]=>
  array(4) {
    ["TITLE"]=>
    int(1)
    ["SUBTITLE"]=>
    int(1)
    ["AUTH"]=>
    int(2)
    ["ABSTRACT"]=>
    int(1)
  }
  [4]=>
  array(2) {
    ["FNAME"]=>
    int(1)
    ["DISPLAY"]=>
    int(1)
  }
}

Which is identical to the max_count of each of your sub arrays.

  • Level 0
    • Data => max_count 1
  • Level 1
    • Record => max_count 2
  • Level 2
    • SAMPLE => max_count 2
  • Level 3
    • TITLE => max_count 1
    • SUBTITLE => max_count 1
    • AUTH => max_count 2
    • ABSTRACT => max_count 1
  • LEVEL 4
    • FNAME => max_count 1
    • DISPLAY => max_count 1

To get the elements for any of these nodes throughout the loop just look at $node->childNodes as you already have the tree (thus eliminating the need for references).

The only reason you needed to nest the elements into your array is because the keys of a PHP array have to be unique and since you're using the node name as the key this requires nesting to get the lower levels of the tree and still structure the value of max_count properly. So it's a data structure problem there and I solve it differently by avoiding modeling the solution after the data structure.

like image 71
Sherif Avatar answered Mar 19 '23 15:03

Sherif