Coming from the beautiful world of c, I am trying understand this behavior:
In [1]: dataset = sqlContext.read.parquet('indir')
In [2]: sizes = dataset.mapPartitions(lambda x: [len(list(x))]).collect()
In [3]: for item in sizes:
...: if(item == min(sizes)):
...: count = count + 1
...:
would not even finish after 20 minutes, and I know that the list sizes
is not that big, less than 205k in length. However this executed instantly:
In [8]: min_item = min(sizes)
In [9]: for item in sizes:
if(item == min_item):
count = count + 1
...:
So what happened?
My guess: python could not understand that min(sizes)
will be always constant, thus replace after the first few calls with its return value..since Python uses the interpreter..
Ref of min() doesn't say anything that would explain the matter to me, but what I was thinking is that it may be that it needs to look at the partitions to do that, but that shouldn't be the case, since sizes
is a list
, not an RDD
!
Edit:
Here is the source of my confusion, I wrote a similar program in C:
for(i = 0; i < SIZE; ++i)
if(i == mymin(array, SIZE))
++count;
and got these timings:
C02QT2UBFVH6-lm:~ gsamaras$ gcc -Wall main.c
C02QT2UBFVH6-lm:~ gsamaras$ ./a.out
That took 98.679177000 seconds wall clock time.
C02QT2UBFVH6-lm:~ gsamaras$ gcc -O3 -Wall main.c
C02QT2UBFVH6-lm:~ gsamaras$ ./a.out
That took 0.000000000 seconds wall clock time.
and for timings, I used Nomimal Animal's approach from my Time measurements.
I'm by no means an expert on the inner workings of python, but from my understanding thus far you'd like to compare the speed of
for item in sizes:
if(item == min(sizes)):
count = count + 1
and
min_item = min(sizes)
for item in sizes:
if(item == min_item):
count = count + 1
Now someone correct me if I have any of this wrong but,
In python lists are mutable and do not have a fixed length, and are treated as such, while in C an array has a fixed size. From this question:
Python lists are very flexible and can hold completely heterogeneous, arbitrary data, and they can be appended to very efficiently, in amortized constant time. If you need to shrink and grow your array time-efficiently and without hassle, they are the way to go. But they use a lot more space than C arrays.
Now take this example
for item in sizes:
if(item == min(sizes)):
new_item = item - 1
sizes.append(new_item)
Then the value of item == min(sizes)
would be different on the next iteration. Python doesn't cache the resulting value of min(sizes)
since it would break the above example, or require some logic to check if the list has been changed. Instead it leaves that up to you. By defining min_item = min(sizes)
you are essentially caching the result yourself.
Now since the array is a fixed size in C, it can find the min value with less overhead than a python list, thus why I think it has no problem in C (as well as C being a much lower level language).
Again, I don't fully understand the underlying code and compilation for python, and I'm certain if you analyzed the process of the loops in python, you'd see python repeatedly computing min(sizes)
, causing the extreme amount of lag. I'd love to learn more about the inner workings of python (for example, are any methods cached in a loop for python, or is everything computed again for each iteration?) so if anyone has more info and/or corrections, let me know!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With