Working with suffix trees in python

Tags:

I'm relatively new to python and am starting to work with suffix trees. I can build them, but I'm running into a memory issue when the string gets large. I know that they can be used to work with DNA strings of size 4^10 or 4^12, but whenever I try to implement a method, I end up with a memory issue.

Here is my code for generating the string and the suffix tree.

import random

def get_string(length):
    string=""
    for i in range(length):
        string += random.choice("ATGC")
    return string

word=get_string(4**4)+"$"

def suffixtree(string):
    for i in xrange(len(string)):
        if tree.has_key(string[i]):
            tree[string[i]].append([string[i+1:]][0])
        else:
            tree[string[i]]=[string[i+1:]]
    return tree

tree={}
suffixtree(word)

When I get up to around 4**8, I run into severe memory problems. I'm rather new to this so I'm sure I'm missing something with storing these things. Any advice would be greatly appreciated.

As a note: I want to do string searching to look for matching strings in a very large string. The search string match size is 16. So, this would look for a string of size 16 within a large string, and then move onto the next string and perform another search. Since I'll be doing a very large number of searches, a suffix tree was suggested.

Many thanks

608

asked Apr 10 '12 09:04

doggysaywhat

2 Answers

This doesn't look like a tree to me. It looks like you are generating all possible suffixes, and storing them in a hashtable.

You will likely get much smaller memory performance if you use an actual tree. I suggest using a library implementation.

145

answered Sep 21 '22 18:09

Marcin

As others have said already, the data structure you are building is not a suffix tree. However, the memory issues stem largely from the fact that your data structure involves a lot of explicit string copies. A call like this

string[i+1:]

creates an actual (deep) copy of the substring starting at i+1.

If you are still interested in constructing your original data structure (whatever its use may be), a good solution is to use buffers instead of string copies. Your algorithm would then look like this:

def suffixtree(string):
    N = len(string)
    for i in xrange(N):
        if tree.has_key(string[i]):
            tree[string[i]].append(buffer(string,i+1,N))
        else:
            tree[string[i]]=[buffer(string,i+1,N)]
    return tree

I tried this embedded in the rest of your code, and confirmed that it requires significantly less then 1 GB of main memory even at a total length of 8^11 characters.

Note that this will likely be relevant even if you switch to an actual suffix tree. A correct suffix tree implementation will not store copies (not even buffers) in the tree edges; however, during tree construction you might need a lot of temporary copies of the strings. Using the buffer type for these is a very good idea to avoid putting a heavy burden on the garbage collector for all the unnecessary explicit string copies.

answered Sep 20 '22 18:09

jogojapan

Related questions
                            
                                Is there a way to incorporate python code into moinmoin pages?
                            
                                how does list(string) split the string to an array of characters in python?
                            
                                pickling class method
                            
                                How to select parent based on the child in lxml?
                            
                                How does web2py query expressions work?
                            
                                Web Scraping data using python?
                            
                                converting a latex code to mathml or svg code in python
                            
                                Place of the import statement
                            
                                How to tail a file with gevent
                            
                                find command with exec in python subprocess gives error
                            
                                Render equation to .png file using Python
                            
                                Python Selenium (waiting for frame, element lookups)
                            
                                Tokenizing large (>70MB) TXT file using Python NLTK. Concatenation & write data to stream errors
                            
                                How does Richardson–Lucy algorithm work? Code example?
                            
                                Finding minimum, maximum and average values for nested lists?
                            
                                re.findall regex hangs or very slow
                            
                                django: create object from constructor or model form
                            
                                Python / Pymongo variable for collection name in db insert command
                            
                                Post request with multipart/form-data in appengine python not working
                            
                                filtered and non-filtered elements in a single pass

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Working with suffix trees in python

Tags:

python

hash

suffix-tree

doggysaywhat

People also ask

2 Answers

Marcin

jogojapan

Recent Activity

Donate For Us