I have got a sequence of strings - 0000001, 0000002, 0000003....
upto 2 million. They are not contiguous. Meaning there are gaps. Say after 0000003 the next string might be 0000006. I need to find out all these gaps. In the above case (0000004, 0000005).
This is what I have done so far -
gaps = list()
total = len(curr_ids)
for i in range(total):
tmp_id = '%s' %(str(i).zfill(7))
if tmp_id in curr_ids:
continue
else:
gaps.append(tmp_id)
return gaps
But as you would have guessed, this is slow since I am using list
. If I use a dict
, to pre-populate curr_ids it'll be faster. But what's the complexity to populating a hash-table? What's the fastest way to do this.
You could sort the list of ids and then step through it once only:
def find_gaps(ids):
"""Generate the gaps in the list of ids."""
j = 1
for id_i in sorted(ids):
while True:
id_j = '%07d' % j
j += 1
if id_j >= id_i:
break
yield id_j
>>> list(find_gaps(["0000001", "0000003", "0000006"]))
['0000002', '0000004', '0000005']
If the input list is already in order, then you can avoid the sorted
(though it does little harm: Python's adaptive mergesort is O(n) if the list is already sorted).
For storing sequence of 2 millions ints you can use bitarray. Here each bit means one integer (the integer of that index in bitarray). Example code:
gaps = []
# bitarray is 0 based
a = bitarray.bitarray(total + 1)
a.setall(False)
for sid in curr_ids:
a[int(sid)] = True
for i in range(1, total):
if not a[i]:
gaps.append('%07d' %(i))
return gaps
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With