I'm relatively new to python and am starting to work with suffix trees. I can build them, but I'm running into a memory issue when the string gets large. I know that they can be used to work with DNA strings of size 4^10 or 4^12, but whenever I try to implement a method, I end up with a memory issue.
Here is my code for generating the string and the suffix tree.
import random
def get_string(length):
string=""
for i in range(length):
string += random.choice("ATGC")
return string
word=get_string(4**4)+"$"
def suffixtree(string):
for i in xrange(len(string)):
if tree.has_key(string[i]):
tree[string[i]].append([string[i+1:]][0])
else:
tree[string[i]]=[string[i+1:]]
return tree
tree={}
suffixtree(word)
When I get up to around 4**8, I run into severe memory problems. I'm rather new to this so I'm sure I'm missing something with storing these things. Any advice would be greatly appreciated.
As a note: I want to do string searching to look for matching strings in a very large string. The search string match size is 16. So, this would look for a string of size 16 within a large string, and then move onto the next string and perform another search. Since I'll be doing a very large number of searches, a suffix tree was suggested.
Many thanks
A suffix tree is a tree data structure typically used to store a list of strings. It is also referred to as the compressed version of a trie, as, unlike a trie, each unique suffix in the list is compressed together and represented by a single node or branch in a suffix tree.
We build a suffix tree by following each suffix and creating an edge for each character, starting with a top node. If the new suffix to be put in the tree begins with a set of characters that are already in the tree, we follow those characters until we have a different one, creating a new branch.
A suffix tree is a rooted, directed tree. It has n leaf nodes labeled from 1 to n, and its edges are labeled by the letters. On a path from the root to the leaf j, one can read the string's suffix S[j, n] and a $ sign.
A suffix tree made of a set of strings is known as Generalized Suffix Tree. We will discuss a simple way to build Generalized Suffix Tree here for two strings only. Later, we will discuss another approach to build Generalized Suffix Tree for two or more strings.
This doesn't look like a tree to me. It looks like you are generating all possible suffixes, and storing them in a hashtable.
You will likely get much smaller memory performance if you use an actual tree. I suggest using a library implementation.
As others have said already, the data structure you are building is not a suffix tree. However, the memory issues stem largely from the fact that your data structure involves a lot of explicit string copies. A call like this
string[i+1:]
creates an actual (deep) copy of the substring starting at i+1
.
If you are still interested in constructing your original data structure (whatever its use may be), a good solution is to use buffers instead of string copies. Your algorithm would then look like this:
def suffixtree(string):
N = len(string)
for i in xrange(N):
if tree.has_key(string[i]):
tree[string[i]].append(buffer(string,i+1,N))
else:
tree[string[i]]=[buffer(string,i+1,N)]
return tree
I tried this embedded in the rest of your code, and confirmed that it requires significantly less then 1 GB of main memory even at a total length of 8^11 characters.
Note that this will likely be relevant even if you switch to an actual suffix tree. A correct suffix tree implementation will not store copies (not even buffers) in the tree edges; however, during tree construction you might need a lot of temporary copies of the strings. Using the buffer
type for these is a very good idea to avoid putting a heavy burden on the garbage collector for all the unnecessary explicit string copies.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With