Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Longest equally-spaced subsequence

I have a million integers in sorted order and I would like to find the longest subsequence where the difference between consecutive pairs is equal. For example

1, 4, 5, 7, 8, 12 

has a subsequence

   4,       8, 12 

My naive method is greedy and just checks how far you can extend a subsequence from each point. This takes O(n²) time per point it seems.

Is there a faster way to solve this problem?

Update. I will test the code given in the answers as soon as possible (thank you). However it is clear already that using n^2 memory will not work. So far there is no code that terminates with the input as [random.randint(0,100000) for r in xrange(200000)] .

Timings. I tested with the following input data on my 32 bit system.

a= [random.randint(0,10000) for r in xrange(20000)]  a.sort() 
  • The dynamic programming method of ZelluX uses 1.6G of RAM and takes 2 minutes and 14 seconds. With pypy it takes only 9 seconds! However it crashes with a memory error on large inputs.
  • The O(nd) time method of Armin took 9 seconds with pypy but only 20MB of RAM. Of course this would be much worse if the range were much larger. The low memory usage meant I could also test it with a= [random.randint(0,100000) for r in xrange(200000)] but it didn't finish in the few minutes I gave it with pypy.

In order to be able to test the method of Kluev's I reran with

a= [random.randint(0,40000) for r in xrange(28000)]  a = list(set(a)) a.sort() 

to make a list of length roughly 20000. All timings with pypy

  • ZelluX, 9 seconds
  • Kluev, 20 seconds
  • Armin, 52 seconds

It seems that if the ZelluX method could be made linear space it would be the clear winner.

like image 704
graffe Avatar asked Aug 10 '13 07:08

graffe


People also ask

Which algorithm is used to find longest consecutive subsequence?

The idea is to use Hashing. We first insert all elements in a Set. Then check all the possible starts of consecutive subsequences.

How do you find the longest subsequence in an array?

To find the LIS for a given array, we need to return max(L(i)) where 0 < i < n. Formally, the length of the longest increasing subsequence ending at index i, will be 1 greater than the maximum of lengths of all longest increasing subsequences ending at indices before i, where arr[j] < arr[i] (j < i).


2 Answers

We can have a solution O(n*m) in time with very little memory needs, by adapting yours. Here n is the number of items in the given input sequence of numbers, and m is the range, i.e. the highest number minus the lowest one.

Call A the sequence of all input numbers (and use a precomputed set() to answer in constant time the question "is this number in A?"). Call d the step of the subsequence we're looking for (the difference between two numbers of this subsequence). For every possible value of d, do the following linear scan over all input numbers: for every number n from A in increasing order, if the number was not already seen, look forward in A for the length of the sequence starting at n with a step d. Then mark all items in that sequence as already seen, so that we avoid searching again from them, for the same d. Because of this, the complexity is just O(n) for every value of d.

A = [1, 4, 5, 7, 8, 12]    # in sorted order Aset = set(A)  for d in range(1, 12):     already_seen = set()     for a in A:         if a not in already_seen:             b = a             count = 1             while b + d in Aset:                 b += d                 count += 1                 already_seen.add(b)             print "found %d items in %d .. %d" % (count, a, b)             # collect here the largest 'count' 

Updates:

  • This solution might be good enough if you're only interested in values of d that are relatively small; for example, if getting the best result for d <= 1000 would be good enough. Then the complexity goes down to O(n*1000). This makes the algorithm approximative, but actually runnable for n=1000000. (Measured at 400-500 seconds with CPython, 80-90 seconds with PyPy, with a random subset of numbers between 0 and 10'000'000.)

  • If you still want to search for the whole range, and if the common case is that long sequences exist, a notable improvement is to stop as soon as d is too large for an even longer sequence to be found.

like image 77
Armin Rigo Avatar answered Oct 11 '22 02:10

Armin Rigo


UPDATE: I've found a paper on this problem, you can download it here.

Here is a solution based on dynamic programming. It requires O(n^2) time complexity and O(n^2) space complexity, and does not use hashing.

We assume all numbers are saved in array a in ascending order, and n saves its length. 2D array l[i][j] defines length of longest equally-spaced subsequence ending with a[i] and a[j], and l[j][k] = l[i][j] + 1 if a[j] - a[i] = a[k] - a[j] (i < j < k).

lmax = 2 l = [[2 for i in xrange(n)] for j in xrange(n)] for mid in xrange(n - 1):     prev = mid - 1     succ = mid + 1     while (prev >= 0 and succ < n):         if a[prev] + a[succ] < a[mid] * 2:             succ += 1         elif a[prev] + a[succ] > a[mid] * 2:             prev -= 1         else:             l[mid][succ] = l[prev][mid] + 1             lmax = max(lmax, l[mid][succ])             prev -= 1             succ += 1  print lmax 
like image 35
ZelluX Avatar answered Oct 11 '22 01:10

ZelluX