Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can numpy be so much faster than my Fortran routine?

I get a 512^3 array representing a Temperature distribution from a simulation (written in Fortran). The array is stored in a binary file that's about 1/2G in size. I need to know the minimum, maximum and mean of this array and as I will soon need to understand Fortran code anyway, I decided to give it a go and came up with the following very easy routine.

  integer gridsize,unit,j   real mini,maxi   double precision mean    gridsize=512   unit=40   open(unit=unit,file='T.out',status='old',access='stream',&        form='unformatted',action='read')   read(unit=unit) tmp   mini=tmp   maxi=tmp   mean=tmp   do j=2,gridsize**3       read(unit=unit) tmp       if(tmp>maxi)then           maxi=tmp       elseif(tmp<mini)then           mini=tmp       end if       mean=mean+tmp   end do   mean=mean/gridsize**3   close(unit=unit) 

This takes about 25 seconds per file on the machine I use. That struck me as being rather long and so I went ahead and did the following in Python:

    import numpy      mmap=numpy.memmap('T.out',dtype='float32',mode='r',offset=4,\                                   shape=(512,512,512),order='F')     mini=numpy.amin(mmap)     maxi=numpy.amax(mmap)     mean=numpy.mean(mmap) 

Now, I expected this to be faster of course, but I was really blown away. It takes less than a second under identical conditions. The mean deviates from the one my Fortran routine finds (which I also ran with 128-bit floats, so I somehow trust it more) but only on the 7th significant digit or so.

How can numpy be so fast? I mean you have to look at every entry of an array to find these values, right? Am I doing something very stupid in my Fortran routine for it to take so much longer?

EDIT:

To answer the questions in the comments:

  • Yes, also I ran the Fortran routine with 32-bit and 64-bit floats but it had no impact on performance.
  • I used iso_fortran_env which provides 128-bit floats.
  • Using 32-bit floats my mean is off quite a bit though, so precision is really an issue.
  • I ran both routines on different files in different order, so the caching should have been fair in the comparison I guess ?
  • I actually tried open MP, but to read from the file at different positions at the same time. Having read your comments and answers this sounds really stupid now and it made the routine take a lot longer as well. I might give it a try on the array operations but maybe that won't even be necessary.
  • The files are actually 1/2G in size, that was a typo, Thanks.
  • I will try the array implementation now.

EDIT 2:

I implemented what @Alexander Vogt and @casey suggested in their answers, and it is as fast as numpy but now I have a precision problem as @Luaan pointed out I might get. Using a 32-bit float array the mean computed by sum is 20% off. Doing

... real,allocatable :: tmp (:,:,:) double precision,allocatable :: tmp2(:,:,:) ... tmp2=tmp mean=sum(tmp2)/size(tmp) ... 

Solves the issue but increases computing time (not by very much, but noticeably). Is there a better way to get around this issue? I couldn't find a way to read singles from the file directly to doubles. And how does numpy avoid this?

Thanks for all the help so far.

like image 734
user35915 Avatar asked Nov 15 '15 19:11

user35915


People also ask

Why NumPy is faster than for loop?

NumPy Arrays are faster than Python Lists because of the following reasons: An array is a collection of homogeneous data-types that are stored in contiguous memory locations. On the other hand, a list in Python is a collection of heterogeneous data types stored in non-contiguous memory locations.

Why NumPy is faster than C?

Numpy is using complex Linear Algebra libraries ! Essentially, Numpy is most of the time not built on pure c/cpp/fortran code... it is actually built on complex libraries that take advantage of the most performant algorithms and ideas to optimise the code.

Does NumPy use Fortran?

Various NumPy modules use FORTRAN 77 libraries, so you'll also need a FORTRAN 77 compiler installed. Note that NumPy is developed mainly using GNU compilers.

How fast is NumPy Where?

Because the Numpy array is densely packed in memory due to its homogeneous type, it also frees the memory faster. So overall a task executed in Numpy is around 5 to 100 times faster than the standard python list, which is a significant leap in terms of speed.


1 Answers

Your Fortran implementation suffers two major shortcomings:

  • You mix IO and computations (and read from the file entry by entry).
  • You don't use vector/matrix operations.

This implementation does perform the same operation as yours and is faster by a factor of 20 on my machine:

program test   integer gridsize,unit   real mini,maxi,mean   real, allocatable :: tmp (:,:,:)    gridsize=512   unit=40    allocate( tmp(gridsize, gridsize, gridsize))    open(unit=unit,file='T.out',status='old',access='stream',&        form='unformatted',action='read')   read(unit=unit) tmp    close(unit=unit)    mini = minval(tmp)   maxi = maxval(tmp)   mean = sum(tmp)/gridsize**3   print *, mini, maxi, mean  end program 

The idea is to read in the whole file into one array tmp in one go. Then, I can use the functions MAXVAL, MINVAL, and SUM on the array directly.


For the accuracy issue: Simply using double precision values and doing the conversion on the fly as

mean = sum(real(tmp, kind=kind(1.d0)))/real(gridsize**3, kind=kind(1.d0)) 

only marginally increases the calculation time. I tried performing the operation element-wise and in slices, but that did only increase the required time at the default optimization level.

At -O3, the element-wise addition performs ~3 % better than the array operation. The difference between double and single precision operations is less than 2% on my machine - on average (the individual runs deviate by far more).


Here is a very fast implementation using LAPACK:

program test   integer gridsize,unit, i, j   real mini,maxi   integer  :: t1, t2, rate   real, allocatable :: tmp (:,:,:)   real, allocatable :: work(:) !  double precision :: mean   real :: mean   real :: slange    call system_clock(count_rate=rate)   call system_clock(t1)   gridsize=512   unit=40    allocate( tmp(gridsize, gridsize, gridsize), work(gridsize))    open(unit=unit,file='T.out',status='old',access='stream',&        form='unformatted',action='read')   read(unit=unit) tmp    close(unit=unit)    mini = minval(tmp)   maxi = maxval(tmp)  !  mean = sum(tmp)/gridsize**3 !  mean = sum(real(tmp, kind=kind(1.d0)))/real(gridsize**3, kind=kind(1.d0))   mean = 0.d0   do j=1,gridsize     do i=1,gridsize       mean = mean + slange('1', gridsize, 1, tmp(:,i,j),gridsize, work)     enddo !i   enddo !j   mean = mean / gridsize**3    print *, mini, maxi, mean   call system_clock(t2)   print *,real(t2-t1)/real(rate)  end program 

This uses the single precision matrix 1-norm SLANGE on matrix columns. The run-time is even faster than the approach using single precision array functions - and does not show the precision issue.

like image 94
Alexander Vogt Avatar answered Sep 25 '22 03:09

Alexander Vogt