Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is "1000000000000000 in range(1000000000000001)" so fast in Python 3?

It is my understanding that the range() function, which is actually an object type in Python 3, generates its contents on the fly, similar to a generator.

This being the case, I would have expected the following line to take an inordinate amount of time because, in order to determine whether 1 quadrillion is in the range, a quadrillion values would have to be generated:

1_000_000_000_000_000 in range(1_000_000_000_000_001) 

Furthermore: it seems that no matter how many zeroes I add on, the calculation more or less takes the same amount of time (basically instantaneous).

I have also tried things like this, but the calculation is still almost instant:

# count by tens 1_000_000_000_000_000_000_000 in range(0,1_000_000_000_000_000_000_001,10) 

If I try to implement my own range function, the result is not so nice!

def my_crappy_range(N):     i = 0     while i < N:         yield i         i += 1     return 

What is the range() object doing under the hood that makes it so fast?


Martijn Pieters's answer was chosen for its completeness, but also see abarnert's first answer for a good discussion of what it means for range to be a full-fledged sequence in Python 3, and some information/warning regarding potential inconsistency for __contains__ function optimization across Python implementations. abarnert's other answer goes into some more detail and provides links for those interested in the history behind the optimization in Python 3 (and lack of optimization of xrange in Python 2). Answers by poke and by wim provide the relevant C source code and explanations for those who are interested.

like image 274
Rick supports Monica Avatar asked May 06 '15 15:05

Rick supports Monica


People also ask

Which is faster range 100000 or Xrange 100000 )?

range() is faster if iterating over the same sequence multiple times. xrange() has to reconstruct the integer object every time, but range() will have real integer objects. (It will always perform worse in terms of memory however)

How does range work in Python 3?

Python range() function returns the sequence of the given number between the given range. range() is a built-in function of Python. It is used when a user needs to perform an action a specific number of times. range() in Python(3.


Video Answer


2 Answers

The Python 3 range() object doesn't produce numbers immediately; it is a smart sequence object that produces numbers on demand. All it contains is your start, stop and step values, then as you iterate over the object the next integer is calculated each iteration.

The object also implements the object.__contains__ hook, and calculates if your number is part of its range. Calculating is a (near) constant time operation *. There is never a need to scan through all possible integers in the range.

From the range() object documentation:

The advantage of the range type over a regular list or tuple is that a range object will always take the same (small) amount of memory, no matter the size of the range it represents (as it only stores the start, stop and step values, calculating individual items and subranges as needed).

So at a minimum, your range() object would do:

class my_range:     def __init__(self, start, stop=None, step=1, /):         if stop is None:             start, stop = 0, start         self.start, self.stop, self.step = start, stop, step         if step < 0:             lo, hi, step = stop, start, -step         else:             lo, hi = start, stop         self.length = 0 if lo > hi else ((hi - lo - 1) // step) + 1      def __iter__(self):         current = self.start         if self.step < 0:             while current > self.stop:                 yield current                 current += self.step         else:             while current < self.stop:                 yield current                 current += self.step      def __len__(self):         return self.length      def __getitem__(self, i):         if i < 0:             i += self.length         if 0 <= i < self.length:             return self.start + i * self.step         raise IndexError('my_range object index out of range')      def __contains__(self, num):         if self.step < 0:             if not (self.stop < num <= self.start):                 return False         else:             if not (self.start <= num < self.stop):                 return False         return (num - self.start) % self.step == 0 

This is still missing several things that a real range() supports (such as the .index() or .count() methods, hashing, equality testing, or slicing), but should give you an idea.

I also simplified the __contains__ implementation to only focus on integer tests; if you give a real range() object a non-integer value (including subclasses of int), a slow scan is initiated to see if there is a match, just as if you use a containment test against a list of all the contained values. This was done to continue to support other numeric types that just happen to support equality testing with integers but are not expected to support integer arithmetic as well. See the original Python issue that implemented the containment test.


* Near constant time because Python integers are unbounded and so math operations also grow in time as N grows, making this a O(log N) operation. Since it’s all executed in optimised C code and Python stores integer values in 30-bit chunks, you’d run out of memory before you saw any performance impact due to the size of the integers involved here.

like image 78
Martijn Pieters Avatar answered Nov 05 '22 07:11

Martijn Pieters


The fundamental misunderstanding here is in thinking that range is a generator. It's not. In fact, it's not any kind of iterator.

You can tell this pretty easily:

>>> a = range(5) >>> print(list(a)) [0, 1, 2, 3, 4] >>> print(list(a)) [0, 1, 2, 3, 4] 

If it were a generator, iterating it once would exhaust it:

>>> b = my_crappy_range(5) >>> print(list(b)) [0, 1, 2, 3, 4] >>> print(list(b)) [] 

What range actually is, is a sequence, just like a list. You can even test this:

>>> import collections.abc >>> isinstance(a, collections.abc.Sequence) True 

This means it has to follow all the rules of being a sequence:

>>> a[3]         # indexable 3 >>> len(a)       # sized 5 >>> 3 in a       # membership True >>> reversed(a)  # reversible <range_iterator at 0x101cd2360> >>> a.index(3)   # implements 'index' 3 >>> a.count(3)   # implements 'count' 1 

The difference between a range and a list is that a range is a lazy or dynamic sequence; it doesn't remember all of its values, it just remembers its start, stop, and step, and creates the values on demand on __getitem__.

(As a side note, if you print(iter(a)), you'll notice that range uses the same listiterator type as list. How does that work? A listiterator doesn't use anything special about list except for the fact that it provides a C implementation of __getitem__, so it works fine for range too.)


Now, there's nothing that says that Sequence.__contains__ has to be constant time—in fact, for obvious examples of sequences like list, it isn't. But there's nothing that says it can't be. And it's easier to implement range.__contains__ to just check it mathematically ((val - start) % step, but with some extra complexity to deal with negative steps) than to actually generate and test all the values, so why shouldn't it do it the better way?

But there doesn't seem to be anything in the language that guarantees this will happen. As Ashwini Chaudhari points out, if you give it a non-integral value, instead of converting to integer and doing the mathematical test, it will fall back to iterating all the values and comparing them one by one. And just because CPython 3.2+ and PyPy 3.x versions happen to contain this optimization, and it's an obvious good idea and easy to do, there's no reason that IronPython or NewKickAssPython 3.x couldn't leave it out. (And in fact, CPython 3.0-3.1 didn't include it.)


If range actually were a generator, like my_crappy_range, then it wouldn't make sense to test __contains__ this way, or at least the way it makes sense wouldn't be obvious. If you'd already iterated the first 3 values, is 1 still in the generator? Should testing for 1 cause it to iterate and consume all the values up to 1 (or up to the first value >= 1)?

like image 26
abarnert Avatar answered Nov 05 '22 07:11

abarnert