Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Working with suffix trees in python

I'm relatively new to python and am starting to work with suffix trees. I can build them, but I'm running into a memory issue when the string gets large. I know that they can be used to work with DNA strings of size 4^10 or 4^12, but whenever I try to implement a method, I end up with a memory issue.

Here is my code for generating the string and the suffix tree.

import random

def get_string(length):
    string=""
    for i in range(length):
        string += random.choice("ATGC")
    return string

word=get_string(4**4)+"$"

def suffixtree(string):
    for i in xrange(len(string)):
        if tree.has_key(string[i]):
            tree[string[i]].append([string[i+1:]][0])
        else:
            tree[string[i]]=[string[i+1:]]
    return tree

tree={}
suffixtree(word)

When I get up to around 4**8, I run into severe memory problems. I'm rather new to this so I'm sure I'm missing something with storing these things. Any advice would be greatly appreciated.

As a note: I want to do string searching to look for matching strings in a very large string. The search string match size is 16. So, this would look for a string of size 16 within a large string, and then move onto the next string and perform another search. Since I'll be doing a very large number of searches, a suffix tree was suggested.

Many thanks

like image 608
doggysaywhat Avatar asked Apr 10 '12 09:04

doggysaywhat


People also ask

How does the suffix tree work?

A suffix tree is a tree data structure typically used to store a list of strings. It is also referred to as the compressed version of a trie, as, unlike a trie, each unique suffix in the list is compressed together and represented by a single node or branch in a suffix tree.

How do you make a suffix tree?

We build a suffix tree by following each suffix and creating an edge for each character, starting with a top node. If the new suffix to be put in the tree begins with a set of characters that are already in the tree, we follow those characters until we have a different one, creating a new branch.

How do you read a suffix tree?

A suffix tree is a rooted, directed tree. It has n leaf nodes labeled from 1 to n, and its edges are labeled by the letters. On a path from the root to the leaf j, one can read the string's suffix S[j, n] and a $ sign.

Is it possible to create a suffix tree for a string?

A suffix tree made of a set of strings is known as Generalized Suffix Tree. We will discuss a simple way to build Generalized Suffix Tree here for two strings only. Later, we will discuss another approach to build Generalized Suffix Tree for two or more strings.


2 Answers

This doesn't look like a tree to me. It looks like you are generating all possible suffixes, and storing them in a hashtable.

You will likely get much smaller memory performance if you use an actual tree. I suggest using a library implementation.

like image 145
Marcin Avatar answered Sep 21 '22 18:09

Marcin


As others have said already, the data structure you are building is not a suffix tree. However, the memory issues stem largely from the fact that your data structure involves a lot of explicit string copies. A call like this

string[i+1:]

creates an actual (deep) copy of the substring starting at i+1.

If you are still interested in constructing your original data structure (whatever its use may be), a good solution is to use buffers instead of string copies. Your algorithm would then look like this:

def suffixtree(string):
    N = len(string)
    for i in xrange(N):
        if tree.has_key(string[i]):
            tree[string[i]].append(buffer(string,i+1,N))
        else:
            tree[string[i]]=[buffer(string,i+1,N)]
    return tree

I tried this embedded in the rest of your code, and confirmed that it requires significantly less then 1 GB of main memory even at a total length of 8^11 characters.

Note that this will likely be relevant even if you switch to an actual suffix tree. A correct suffix tree implementation will not store copies (not even buffers) in the tree edges; however, during tree construction you might need a lot of temporary copies of the strings. Using the buffer type for these is a very good idea to avoid putting a heavy burden on the garbage collector for all the unnecessary explicit string copies.

like image 28
jogojapan Avatar answered Sep 20 '22 18:09

jogojapan