Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

library for transforming a node tree

I'd like to be able to express a general transformation of one tree into another without writing a bunch of repetitive spaghetti code. Are there any libraries to help with this problem? My target language is Python, but I'll look at other languages as long as it's feasible to port to Python.

Example: I'd like to transform this node tree: (please excuse the S-expressions)

(A (B) (C) (D))

Into this one:

(C (B) (D))

As long as the parent is A and the second ancestor is C, regardless of context (there may be more parents or ancestors). I'd like to express this transformation in a simple, concise, and re-usable way. Of course this example is very specific. Please try to address the general case.

Edit: RefactoringNG is the kind of thing I'm looking for, although it introduces an entirely new grammar to solve the problem, which i'd like to avoid. I'm still looking for more and/or better examples.


Background:

I'm able to convert python and cheetah (don't ask!) files into tokenized tree representations, and in turn convert those into lxml trees. I plan to then re-organize the tree and write-out the results in order to implement automated refactoring. XSLT seems to be the standard tool to rewrite XML, but the syntax is terrible (in my opinion, obviously) and nobody at our shop would understand it.

I could write some functions which simply use the lxml methods (.xpath and such) to implement my refactorings, but I'm worried that I will wind up with a bunch of purpose-built spaghetti code which can't be re-used.

like image 760
bukzor Avatar asked Jan 18 '12 21:01

bukzor


2 Answers

Let's try this in Python code. I've used strings for the leaves, but this will work with any objects.

def lift_middle_child(in_tree):
    (A, (B,), (C,), (D,)) = in_tree

    return (C, (B,), (D,))

print lift_middle_child(('A', ('B',), ('C',), ('D',))) # could use lists too

This sort of tree transformation is generally better performed in a functional style - if you create a bunch of these functions, you can explicitly compose them, or create a composition function to work with them in a point-free style.

Because you've used s-expressions, I assume you're comfortable representing trees as nested lists (or the equivalent - unless I'm mistaken, lxml nodes are iterable in that way). Obviously, this example relies on a known input structure, but your question implies that. You can write more flexible functions, and still compose them, as long as they have this uniform interface.

Here's the code in action: http://ideone.com/02Uv0i

Now, here's a function to reverse children, and using that and the above function, one to lift and reverse:

def compose2(a,b): # might want to get this from the functional library
    return lambda *x: a(b(*x))

def compose(*funcs): #compose(a,b,c) = a(b(c(x))) - you might want to reverse that
    return reduce(compose2,funcs)

def reverse_children(in_tree):
    return in_tree[0:1] + in_tree[1:][::-1] # slightly cryptic, but works for anything subscriptable

lift_and_reverse = compose(reverse_children,lift_middle_child) # right most function applied first - if you find this confusing, reverse order in compose function.

print lift_and_reverse(('A', ('B',), ('C',), ('D',)))
like image 90
Marcin Avatar answered Nov 05 '22 18:11

Marcin


What you really want IMHO is an program transformation system, which allows you to parse and transform code using the patterns expressed in the surface syntax of the source code (and even the target language) to express the rewrites directly.

You will find that even if you can get your hands on an XML representation of the Python tree, that the effort to write an XSLT/XPath transformation is more than you expect; trees representing real code are messier than you'd expect, XSLT isn't that convenient a notation, and it cannot express directly common conditions on trees that you'd like to check (e.g., that two subtrees are the same). An final complication with XML: assume its has been transformed. How do you regenerate the source code syntax from which came? You need some kind of prettyprinter.

A general problem regardless of how the code is represented is that without information about scopes and types (where you can get it), writing correct transformations is pretty hard. After all, if you are going to transform python into a language that uses different operators for string concat and arithmetic (unlike Java which uses "+" for both), you need to be able to decide which operator to generate. So you need type information to decide. Python is arguably typeless, but in practice most expressions involve variables which have only one type for their entire lifetime. So you'll also need flow analysis to compute types.

Our DMS Software Reengineering Toolkit has all of these capabilities (parsing, flow analysis, pattern matching/rewriting, prettyprinting), and robust parsers for many languages including Python. (While it has flow analysis capability instantiated for C, COBOL, Java, this is not instantiated for Python. But then, you said you wanted to do the transformation regardless of context).

To express your rewrite in DMS on Python syntax close to your example (which isn't Python?)

  domain Python;

  rule revise_arguments(f:IDENTIFIER,A:expression,B:expression,
                                     C:expression,D:expression):primary->primary
  =  " \f(\A,(\B),(\C),(\D)) "
  -> " \f(\C,(\B),(\D)) ";

The notation above is the DMS rule-rewriting language (RSL). The "..." are metaquotes that separate Python syntax (inside those quotes, DMS knows it is Python because of the domain notation declaration) from the DMS RSL language. The \n inside the meta quote refers to the syntax variable placeholders of the named nonterminal type defined in the rule parameter list. Yes, (...) inside the metaquotes are Python ( ) ... they exist in the syntax trees as far as DMS is concerned, because they, like the rest of the language, are just syntax.

The above rule looks a bit odd because I'm trying to follow your example as close as possible, and from and expression language point of view, your example is odd precisely because it does have unusual parentheses.

With this rule, DMS could parse Python (using its Python parser) like

        foobar(2+3,(x-y),(p),(baz()))

build an AST, match the (parsed-to-AST) rule against that AST, rewrite it to another AST corresponding to:

        foobar(p,(x-y),(baz()))

and then prettyprint the surface syntax (valid) python back out.

If you intended your example to be a transformation on LISP code, you'd need a LISP grammar for DMS (not hard to build, but we don't have much call for this), and write corresponding surface syntax:

 domain Lisp;

  rule revise_form(A:form,B:form, C:form, D:form):form->form
  =  " (\A,(\B),(\C),(\D)) "
  -> " (\C,(\B),(\D)) ";

You can get a better feel for this by looking at Algebra as a DMS domain.

If your goal is to implement all this in Python... I don't have much help. DMS is a pretty big system, and it would be a lot of effort to replicate.

like image 36
Ira Baxter Avatar answered Nov 05 '22 17:11

Ira Baxter