I have a million integers in sorted order and I would like to find the longest subsequence where the difference between consecutive pairs is equal. For example
1, 4, 5, 7, 8, 12
has a subsequence
4, 8, 12
My naive method is greedy and just checks how far you can extend a subsequence from each point. This takes O(n²)
time per point it seems.
Is there a faster way to solve this problem?
Update. I will test the code given in the answers as soon as possible (thank you). However it is clear already that using n^2 memory will not work. So far there is no code that terminates with the input as [random.randint(0,100000) for r in xrange(200000)]
.
Timings. I tested with the following input data on my 32 bit system.
a= [random.randint(0,10000) for r in xrange(20000)] a.sort()
In order to be able to test the method of Kluev's I reran with
a= [random.randint(0,40000) for r in xrange(28000)] a = list(set(a)) a.sort()
to make a list of length roughly 20000
. All timings with pypy
It seems that if the ZelluX method could be made linear space it would be the clear winner.
The idea is to use Hashing. We first insert all elements in a Set. Then check all the possible starts of consecutive subsequences.
To find the LIS for a given array, we need to return max(L(i)) where 0 < i < n. Formally, the length of the longest increasing subsequence ending at index i, will be 1 greater than the maximum of lengths of all longest increasing subsequences ending at indices before i, where arr[j] < arr[i] (j < i).
We can have a solution O(n*m)
in time with very little memory needs, by adapting yours. Here n
is the number of items in the given input sequence of numbers, and m
is the range, i.e. the highest number minus the lowest one.
Call A the sequence of all input numbers (and use a precomputed set()
to answer in constant time the question "is this number in A?"). Call d the step of the subsequence we're looking for (the difference between two numbers of this subsequence). For every possible value of d, do the following linear scan over all input numbers: for every number n from A in increasing order, if the number was not already seen, look forward in A for the length of the sequence starting at n with a step d. Then mark all items in that sequence as already seen, so that we avoid searching again from them, for the same d. Because of this, the complexity is just O(n)
for every value of d.
A = [1, 4, 5, 7, 8, 12] # in sorted order Aset = set(A) for d in range(1, 12): already_seen = set() for a in A: if a not in already_seen: b = a count = 1 while b + d in Aset: b += d count += 1 already_seen.add(b) print "found %d items in %d .. %d" % (count, a, b) # collect here the largest 'count'
Updates:
This solution might be good enough if you're only interested in values of d that are relatively small; for example, if getting the best result for d <= 1000
would be good enough. Then the complexity goes down to O(n*1000)
. This makes the algorithm approximative, but actually runnable for n=1000000
. (Measured at 400-500 seconds with CPython, 80-90 seconds with PyPy, with a random subset of numbers between 0 and 10'000'000.)
If you still want to search for the whole range, and if the common case is that long sequences exist, a notable improvement is to stop as soon as d is too large for an even longer sequence to be found.
UPDATE: I've found a paper on this problem, you can download it here.
Here is a solution based on dynamic programming. It requires O(n^2) time complexity and O(n^2) space complexity, and does not use hashing.
We assume all numbers are saved in array a
in ascending order, and n
saves its length. 2D array l[i][j]
defines length of longest equally-spaced subsequence ending with a[i]
and a[j]
, and l[j][k]
= l[i][j]
+ 1 if a[j]
- a[i]
= a[k]
- a[j]
(i < j < k).
lmax = 2 l = [[2 for i in xrange(n)] for j in xrange(n)] for mid in xrange(n - 1): prev = mid - 1 succ = mid + 1 while (prev >= 0 and succ < n): if a[prev] + a[succ] < a[mid] * 2: succ += 1 elif a[prev] + a[succ] > a[mid] * 2: prev -= 1 else: l[mid][succ] = l[prev][mid] + 1 lmax = max(lmax, l[mid][succ]) prev -= 1 succ += 1 print lmax
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With