Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

strcmp for python or how to sort substrings efficiently (without copy) when building a suffix array

Here's a very simple way to build an suffix array from a string in python:

def sort_offsets(a, b):
    return cmp(content[a:], content[b:])

content = "foobar baz foo"
suffix_array.sort(cmp=sort_offsets)
print suffix_array
[6, 10, 4, 8, 3, 7, 11, 0, 13, 2, 12, 1, 5, 9]

However, "content[a:]" makes a copy of content, which becomes very inefficient when content gets large. So i wonder if there's a way to compare the two substrings without having to copy them. I've tried to use the buffer-builtin, but it didn't worked.

like image 650
ephes Avatar asked Feb 17 '10 16:02

ephes


2 Answers

The buffer function does not copy the whole string, but creates an object that only references the source string. Using interjay's suggestion, that would be:

suffix_array.sort(key=lambda a: buffer(content, a))
like image 124
AndiDog Avatar answered Sep 30 '22 14:09

AndiDog


I don't know if there's a fast way to compare substrings, but you can make your code much faster (and simpler) by using key instead of cmp:

suffix_array.sort(key=lambda a: content[a:])

This will create the substring just once for each value of a.

Edit: A possible downside is that it will require O(n^2) memory for the substrings.

like image 26
interjay Avatar answered Sep 30 '22 14:09

interjay