Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to convert a data.frame to tree structure object such as dendrogram

I have a data.frame object. For a simple example:

> data.frame(x=c('A','A','B','B','B'), y=c('Ab','Ac','Ba', 'Ba','Bd'), z=c('Abb','Acc','Bad', 'Bae','Bdd'))
  x  y   z
1 A Ab Abb
2 A Ac Acc
3 B Ba Bad
4 B Ba Bae
5 B Bd Bdd

there are a lot more rows and columns in the actual data. how could I create a nested tree structure object of dendrogram like this:

         |---Ab---Abb
     A---|
     |   |---Ac---Acc
   --|                 /--Bad 
     |   |---Ba-------|
     B---|             \--Bae
         |---Bb---Bdd
like image 895
RNA Avatar asked Mar 11 '13 16:03

RNA


2 Answers

data.frame to Newick

I did my PhD in computational phylogenetics and somewhere along the way I produced this code, that I used once or twice when I got some data in this nonstandard format (in phylogenetic sense). The script traverses the dataframe as if it were a tree ... and pastes stuff along the way into a Newick string, which is a standard format and can be then transformed in any kind of tree object.

I guess the script could be optimized (I used it so rarely that more work on it would reduce the overall efficiency), but at least it is better to share than to let it collect dust laying around on my harddrive.

    ## recursion function
    traverse <- function(a,i,innerl){
        if(i < (ncol(df))){
            alevelinner <- as.character(unique(df[which(as.character(df[,i])==a),i+1]))
            desc <- NULL
            if(length(alevelinner) == 1) (newickout <- traverse(alevelinner,i+1,innerl))
            else {
                for(b in alevelinner) desc <- c(desc,traverse(b,i+1,innerl))
                il <- NULL; if(innerl==TRUE) il <- a
                (newickout <- paste("(",paste(desc,collapse=","),")",il,sep=""))
            }
        }
        else { (newickout <- a) }
    }

    ## data.frame to newick function
    df2newick <- function(df, innerlabel=FALSE){
        alevel <- as.character(unique(df[,1]))
        newick <- NULL
        for(x in alevel) newick <- c(newick,traverse(x,1,innerlabel))
        (newick <- paste("(",paste(newick,collapse=","),");",sep=""))
    }

The main function df2newick() takes two arguments:

  • df which is the dataframe to be transformed (object of class data.frame)
  • innerlabel which tells the function to write labels for inner nodes (bulean)

To demonstrate it on your example:

    df <- data.frame(x=c('A','A','B','B','B'), y=c('Ab','Ac','Ba', 'Ba','Bd'), z=c('Abb','Acc','Bad', 'Bae','Bdd'))
    myNewick <- df2newick(df)
    #[1] "((Abb,Acc),((Bad,Bae),Bdd));"

Now you could read it into a object of class phylo with read.tree() from ape

    library(ape)
    mytree <- read.tree(text=myNewick)
    plot(mytree)

If you want to add inner node labels to the Newick string, you can use this:

    myNewick <- df2newick(df, TRUE)
    #[1] "((Abb,Acc)A,((Bad,Bae)Ba,Bdd)B);"

Hope this is useful (and maybe my PhD wasn't a complete waist of time ;-)


Additional note for your dataframe format:

As you can observe the df2newick function ignores inner modes with one child (which is anyway best to be used with most phylogenetic methods ... was only relevant to me). The df objects that I originally got and used with this script were of this format:

    df <- data.frame(x=c('A','A','B','B','B'), y=c('Abb','Acc','Ba', 'Ba','Bdd'), z=c('Abb','Acc','Bad', 'Bae','Bdd'))

Very similar to yours ... but the "inner singe child nodes" just had the same name as their children, but you have different inner names for this nodes too, and the names get ignored ... might not be relevant but you can just ignore a part of the recursion function, like this:

    traverse <- function(a,i,innerl){
        if(i < (ncol(df))){
            alevelinner <- as.character(unique(df[which(as.character(df[,i])==a),i+1]))
            desc <- NULL
            ##if(length(alevelinner) == 1) (newickout <- traverse(alevelinner,i+1,innerl))
            ##else {
                for(b in alevelinner) desc <- c(desc,traverse(b,i+1,innerl))
                il <- NULL; if(innerl==TRUE) il <- a
                (newickout <- paste("(",paste(desc,collapse=","),")",il,sep=""))
            ##}
        }
        else { (newickout <- a) }
    }

and you would get something like this:

    [1] "(((Abb)Ab,(Acc)Ac)A,((Bad,Bae)Ba,(Bdd)Bd)B);"

This really looks odd to me, but I add it just in case, cause it really includes now all the information from your original dataframe.

like image 69
Martin Turjak Avatar answered Nov 11 '22 14:11

Martin Turjak


I don't know much about the internal structure of dendrograms in R, but the following code will create a nested list structure that has the hierarchy that I think you look for:

stree = function(x,level=0) {
#x is a string vector
#resultis a hierarchical structure of lists (that contains lists, etc.)
#the names of the lists are the node values.

level = level+1
if (length(x)==1) {
    result = list()
    result[[substring(x[1],level)]]=list()
    return(result)
}
result=list()
this.level = substring(x,level,level)
next.levels = unique(this.level)
for (p in next.levels) {
    if (p=="") {
        result$p = list()
    } else {
        ids = which(this.level==p)
        result[[p]] = stree(x[ids],level)
    }
}
result
}

it operates on a vector of strings. so in case of your dataframe you'd need to call stree(as.character(df[,3]))

Hope this helps.

like image 41
amit Avatar answered Nov 11 '22 15:11

amit