I have a million integers in sorted order and I would like to find the longest subsequence where the difference between consecutive pairs is equal. For example <pre class="prettyprint"><code>1, 4, 5, 7, 8, 12 </code></pre> has a subsequence <pre class="prettyprint"><code> 4, 8, 12 </code></pre> My naive method is greedy and just checks how far you can extend a subsequence from each point. This takes <code>O(n²)</code> time per point it seems. Is there a faster way to solve this problem? Update. I will test the code given in the answers as soon as possible (thank you). However it is clear already that using n^2 memory will not work. So far there is no code that terminates with the input as <code>[random.randint(0,100000) for r in xrange(200000)]</code> . Timings. I tested with the following input data on my 32 bit system. <pre class="prettyprint"><code>a= [random.randint(0,10000) for r in xrange(20000)] a.sort() </code></pre> <ul> <li>The dynamic programming method of ZelluX uses 1.6G of RAM and takes 2 minutes and 14 seconds. With pypy it takes only 9 seconds! However it crashes with a memory error on large inputs.</li> <li>The O(nd) time method of Armin took 9 seconds with pypy but only 20MB of RAM. Of course this would be much worse if the range were much larger. The low memory usage meant I could also test it with a= [random.randint(0,100000) for r in xrange(200000)] but it didn't finish in the few minutes I gave it with pypy.</li> </ul> In order to be able to test the method of Kluev's I reran with <pre class="prettyprint"><code>a= [random.randint(0,40000) for r in xrange(28000)] a = list(set(a)) a.sort() </code></pre> to make a list of length roughly <code>20000</code>. All timings with pypy <ul> <li>ZelluX, 9 seconds</li> <li>Kluev, 20 seconds</li> <li>Armin, 52 seconds</li> </ul> It seems that if the ZelluX method could be made linear space it would be the clear winner.

We can have a solution <code>O(n*m)</code> in time with very little memory needs, by adapting yours. Here <code>n</code> is the number of items in the given input sequence of numbers, and <code>m</code> is the range, i.e. the highest number minus the lowest one. Call A the sequence of all input numbers (and use a precomputed <code>set()</code> to answer in constant time the question "is this number in A?"). Call d the step of the subsequence we're looking for (the difference between two numbers of this subsequence). For every possible value of d, do the following linear scan over all input numbers: for every number n from A in increasing order, if the number was not already seen, look forward in A for the length of the sequence starting at n with a step d. Then mark all items in that sequence as already seen, so that we avoid searching again from them, for the same d. Because of this, the complexity is just <code>O(n)</code> for every value of d. <pre class="prettyprint"><code>A = [1, 4, 5, 7, 8, 12] # in sorted order Aset = set(A) for d in range(1, 12): already_seen = set() for a in A: if a not in already_seen: b = a count = 1 while b + d in Aset: b += d count += 1 already_seen.add(b) print "found %d items in %d .. %d" % (count, a, b) # collect here the largest 'count' </code></pre> Updates: <ul> <li>This solution might be good enough if you're only interested in values of d that are relatively small; for example, if getting the best result for <code>d <= 1000</code> would be good enough. Then the complexity goes down to <code>O(n*1000)</code>. This makes the algorithm approximative, but actually runnable for <code>n=1000000</code>. (Measured at 400-500 seconds with CPython, 80-90 seconds with PyPy, with a random subset of numbers between 0 and 10'000'000.)</li> <li>If you still want to search for the whole range, and if the common case is that long sequences exist, a notable improvement is to stop as soon as d is too large for an even longer sequence to be found.</li> </ul>

UPDATE: I've found a paper on this problem, you can download it here. Here is a solution based on dynamic programming. It requires O(n^2) time complexity and O(n^2) space complexity, and does not use hashing. We assume all numbers are saved in array <code>a</code> in ascending order, and <code>n</code> saves its length. 2D array <code>l[i][j]</code> defines length of longest equally-spaced subsequence ending with <code>a[i]</code> and <code>a[j]</code>, and <code>l[j][k]</code> = <code>l[i][j]</code> + 1 if <code>a[j]</code> - <code>a[i]</code> = <code>a[k]</code> - <code>a[j]</code> (i < j < k). <pre class="prettyprint"><code>lmax = 2 l = [[2 for i in xrange(n)] for j in xrange(n)] for mid in xrange(n - 1): prev = mid - 1 succ = mid + 1 while (prev >= 0 and succ < n): if a[prev] + a[succ] < a[mid] * 2: succ += 1 elif a[prev] + a[succ] > a[mid] * 2: prev -= 1 else: l[mid][succ] = l[prev][mid] + 1 lmax = max(lmax, l[mid][succ]) prev -= 1 succ += 1 print lmax </code></pre>

Longest equally-spaced subsequence

Tags:

python

algorithm

I have a million integers in sorted order and I would like to find the longest subsequence where the difference between consecutive pairs is equal. For example

1, 4, 5, 7, 8, 12

has a subsequence

   4,       8, 12

My naive method is greedy and just checks how far you can extend a subsequence from each point. This takes O(n²) time per point it seems.

Is there a faster way to solve this problem?

Update. I will test the code given in the answers as soon as possible (thank you). However it is clear already that using n^2 memory will not work. So far there is no code that terminates with the input as [random.randint(0,100000) for r in xrange(200000)] .

Timings. I tested with the following input data on my 32 bit system.

a= [random.randint(0,10000) for r in xrange(20000)]  a.sort()

The dynamic programming method of ZelluX uses 1.6G of RAM and takes 2 minutes and 14 seconds. With pypy it takes only 9 seconds! However it crashes with a memory error on large inputs.
The O(nd) time method of Armin took 9 seconds with pypy but only 20MB of RAM. Of course this would be much worse if the range were much larger. The low memory usage meant I could also test it with a= [random.randint(0,100000) for r in xrange(200000)] but it didn't finish in the few minutes I gave it with pypy.

In order to be able to test the method of Kluev's I reran with

a= [random.randint(0,40000) for r in xrange(28000)]  a = list(set(a)) a.sort()

to make a list of length roughly 20000. All timings with pypy

ZelluX, 9 seconds
Kluev, 20 seconds
Armin, 52 seconds

It seems that if the ZelluX method could be made linear space it would be the clear winner.

704

asked Aug 10 '13 07:08

graffe

2 Answers

We can have a solution O(n*m) in time with very little memory needs, by adapting yours. Here n is the number of items in the given input sequence of numbers, and m is the range, i.e. the highest number minus the lowest one.

Call A the sequence of all input numbers (and use a precomputed set() to answer in constant time the question "is this number in A?"). Call d the step of the subsequence we're looking for (the difference between two numbers of this subsequence). For every possible value of d, do the following linear scan over all input numbers: for every number n from A in increasing order, if the number was not already seen, look forward in A for the length of the sequence starting at n with a step d. Then mark all items in that sequence as already seen, so that we avoid searching again from them, for the same d. Because of this, the complexity is just O(n) for every value of d.

A = [1, 4, 5, 7, 8, 12]    # in sorted order Aset = set(A)  for d in range(1, 12):     already_seen = set()     for a in A:         if a not in already_seen:             b = a             count = 1             while b + d in Aset:                 b += d                 count += 1                 already_seen.add(b)             print "found %d items in %d .. %d" % (count, a, b)             # collect here the largest 'count'

Updates:

This solution might be good enough if you're only interested in values of d that are relatively small; for example, if getting the best result for d <= 1000 would be good enough. Then the complexity goes down to O(n*1000). This makes the algorithm approximative, but actually runnable for n=1000000. (Measured at 400-500 seconds with CPython, 80-90 seconds with PyPy, with a random subset of numbers between 0 and 10'000'000.)
If you still want to search for the whole range, and if the common case is that long sequences exist, a notable improvement is to stop as soon as d is too large for an even longer sequence to be found.

answered Oct 11 '22 02:10

Armin Rigo

UPDATE: I've found a paper on this problem, you can download it here.

Here is a solution based on dynamic programming. It requires O(n^2) time complexity and O(n^2) space complexity, and does not use hashing.

We assume all numbers are saved in array a in ascending order, and n saves its length. 2D array l[i][j] defines length of longest equally-spaced subsequence ending with a[i] and a[j], and l[j][k] = l[i][j] + 1 if a[j] - a[i] = a[k] - a[j] (i < j < k).

lmax = 2 l = [[2 for i in xrange(n)] for j in xrange(n)] for mid in xrange(n - 1):     prev = mid - 1     succ = mid + 1     while (prev >= 0 and succ < n):         if a[prev] + a[succ] < a[mid] * 2:             succ += 1         elif a[prev] + a[succ] > a[mid] * 2:             prev -= 1         else:             l[mid][succ] = l[prev][mid] + 1             lmax = max(lmax, l[mid][succ])             prev -= 1             succ += 1  print lmax

answered Oct 11 '22 01:10

ZelluX

Related questions
                            
                                Django: using <select multiple> and POST
                            
                                Why aren't Python sets hashable?
                            
                                How to implement retry mechanism into python requests library?
                            
                                User-friendly time format in Python?
                            
                                Find the end of the month of a Pandas DataFrame Series
                            
                                How do I use pdfminer as a library
                            
                                efficiently checking that string consists of one character in Python
                            
                                How to add a calculated field to a Django model
                            
                                TypeError: 'int' object is not callable
                            
                                Pandas groupby month and year
                            
                                How do I resolve a TesseractNotFoundError?
                            
                                Appending column totals to a Pandas DataFrame
                            
                                What's the simplest way of detecting keyboard input in a script from the terminal?
                            
                                Is there a way to make the Tkinter text widget read only?
                            
                                pip on Windows giving the error - Unknown or unsupported command 'install'
                            
                                Django not sending emails to admins
                            
                                Symbol not found: __PyCodecInfo_GetIncrementalDecoder
                            
                                Removing space from columns in pandas
                            
                                Check if a number is odd or even in python [duplicate]
                            
                                What SOAP libraries exist for Python 3.x? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With