This is a follow-up to one of my previous posts.
I tried to understand why the RuleTransformer performance is so poor. Now I believe that it is so slow because its complexity is O(2n), where n is the height of the input XML tree.
Suppose I need to rename all labels of all elements to label "b":
import scala.xml._, scala.xml.transform._
val rule: RewriteRule = new RewriteRule() {
override def transform(node: Node): Seq[Node] = node match {
case e: Elem => e.copy(label = "b")
case other => other
}
}
def trans(node: Node): Node = new RuleTransformer(rule).apply(node)
Let's count how many times the transform
visits each node in input <a3><a2><a1/></a2></a3>
.
In order to count the visits we add a buffer visited
, init it in the beginning, store visited nodes, and print it in the end.
import scala.collection.mutable.ListBuffer
// buffer to store visited nodes
var visited: ListBuffer[Node] = ListBuffer[Node]()
val rule: RewriteRule = new RewriteRule() {
override def transform(n: Node): Seq[Node] = {
visited append (n) // count this visit
n match {
case e: Elem => e.copy(label = "b")
case other => other
}
}
}
def trans(node: Node): Node = {
visited = ListBuffer[Node]() // init the buffer
val r = new RuleTransformer(rule).apply(node)
// print visited nodes and numbers of visits
println(visited.groupBy(identity).mapValues(_.size).toSeq.sortBy(_._2))
r
}
Now let's run it in REPL and see the visited
scala> val a3 = <a3><a2><a1/></a2></a3>
a3: scala.xml.Elem = <a3><a2><a1/></a2></a3>
scala> trans(a3)
ArrayBuffer((<a3><b><b/></b></a3>,2), (<a2><b/></a2>,4), (<a1/>,8))
res1: scala.xml.Node = <b><b><b/></b></b>
So a1
is visited eight times.
If we transform <a4><a3><a2><a1/></a2></a3></a4>
then a1
will be visited 16 times, for <a5><a4><a3><a2><a1/></a2></a3></a4></a5>
-- 32, etc. So the complexity looks exponential.
Does it make sense ? How would you prove it by analysis of the code ?
This will be not a very rigours analysis, but the problem seems to be with the BasicTransformer's transform(Seq[Node]) method[1].
The child transform method will be called twice for a node which is changed. Specifically in your example all the nodes will be called twice for this reason. And you are right in that each node at height h will be called 2^(h-1) times. Also note that at most one child of a node will be called twice because use of span and code, in the specific example all the child of the nodes will be called twice.
Just to verify this wrote these code samples for modified RulesTransformer. ( I could have overriden RulesTransformer too. But anyway)
// This is same as library RuleTransformer added with print statement
class CopiedRuleTransformer(rules: RewriteRule*) extends BasicTransformer {
override def transform(n: Node): Seq[Node] = {
println(n)
rules.foldLeft(super.transform(n)) { (res, rule) => rule transform res }
}
}
My modified RuleTransformer
class MyRuleTransformer(rules: RewriteRule*) extends BasicTransformer {
override def transform(n: Node): Seq[Node] = {
println(n)
rules.foldLeft(super.transform(n)) { (res, rule) => rule transform res }
}
override def transform(ns:Seq[Node]): Seq[Node] = {
ns.flatMap(n => transform(n))
}
}
These codes are only for demonstrating purpose. You can call these as
new CopiedRuleTransformer(rule).apply(node)
or
new MyRuleTransformer(rule).apply(node)
[1] line : 34 https://github.com/scala/scala-xml/blob/master/src/main/scala/scala/xml/transform/BasicTransformer.scala
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With