Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python double free error for huge datasets

I have a very simple script in Python, but for some reason I get the following error when running a large amount of data:

*** glibc detected *** python: double free or corruption (out): 0x00002af5a00cc010 ***

I am used to these errors coming up in C or C++, when one tries to free memory that has already been freed. However, by my understanding of Python (and especially the way I've written the code), I really don't understand why this should happen.

Here is the code:

#!/usr/bin/python -tt                                                                                                                                                                                                                         

import sys, commands, string
import numpy as np
import scipy.io as io
from time import clock

W = io.loadmat(sys.argv[1])['W']
size = W.shape[0]
numlabels = int(sys.argv[2])
Q = np.zeros((size, numlabels), dtype=np.double)
P = np.zeros((size, numlabels), dtype=np.double)
Q += 1.0 / Q.shape[1]
nu = 0.001
mu = 0.01
start = clock()
mat = -nu + mu*(W*(np.log(Q)-1))
end = clock()
print >> sys.stderr, "Time taken to compute matrix: %.2f seconds"%(end-start)

One may ask, why declare a P and a Q numpy array? I simply do that to reflect the actual conditions (as this code is simply a segment of what I actually do, where I need a P matrix and declare it beforehand).

I have access to a 192GB machine, and so I tested this out on a very large SciPy sparse matrix (2.2 million by 2.2 million, but very sparse, that's not the issue). The main memory is taken up by the Q, P, and mat matrices, as they are all 2.2 million by 2000 matrices (size = 2.2 million, numlabels = 2000). The peak memory goes up to 131GB, which comfortably fits in memory. While the mat matrix is being computed, I get the glibc error, and my process automatically goes into the sleep (S) state, without deallocating the 131GB it has taken up.

Given the bizarre (for Python) error (I am not explicitly deallocating anything), and the fact that this works nicely for smaller matrix sizes (around 1.5 million by 2000), I am really not sure where to start to debug this.

As a starting point, I have set "ulimit -s unlimited" before running, but to no avail.

Any help or insight into numpy's behavior with really large amounts of data would be welcome.

Note that this is NOT an out of memory error - I have 196GB, and my process reaches around 131GB and stays there for some time before giving the error below.

Update: February 16, 2013 (1:10 PM PST):

As per suggestions, I ran Python with GDB. Interestingly, on one GDB run I forgot to set the stack size limit to "unlimited", and got the following output:

*** glibc detected *** /usr/bin/python: munmap_chunk(): invalid pointer: 0x00007fe7508a9010 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x733b6)[0x7ffff6ec23b6]
/usr/lib64/python2.7/site-packages/numpy/core/multiarray.so(+0x4a496)[0x7ffff69fc496]
/usr/lib64/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x4e67)[0x7ffff7af48c7]
/usr/lib64/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x309)[0x7ffff7af6c49]
/usr/lib64/libpython2.7.so.1.0(PyEval_EvalCode+0x32)[0x7ffff7b25592]
/usr/lib64/libpython2.7.so.1.0(+0xfcc61)[0x7ffff7b33c61]
/usr/lib64/libpython2.7.so.1.0(PyRun_FileExFlags+0x84)[0x7ffff7b34074]
/usr/lib64/libpython2.7.so.1.0(PyRun_SimpleFileExFlags+0x189)[0x7ffff7b347c9]
/usr/lib64/libpython2.7.so.1.0(Py_Main+0x36c)[0x7ffff7b3e1bc]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7ffff6e6dbfd]
/usr/bin/python[0x4006e9]
======= Memory map: ========
00400000-00401000 r-xp 00000000 09:01 50336181                           /usr/bin/python2.7
00600000-00601000 r--p 00000000 09:01 50336181                           /usr/bin/python2.7
00601000-00602000 rw-p 00001000 09:01 50336181                           /usr/bin/python2.7
00602000-00e5f000 rw-p 00000000 00:00 0                                  [heap]
7fdf2584c000-7ffff0a66000 rw-p 00000000 00:00 0 
7ffff0a66000-7ffff0a6b000 r-xp 00000000 09:01 50333916                   /usr/lib64/python2.7/lib-dynload/mmap.so
7ffff0a6b000-7ffff0c6a000 ---p 00005000 09:01 50333916                   /usr/lib64/python2.7/lib-dynload/mmap.so
7ffff0c6a000-7ffff0c6b000 r--p 00004000 09:01 50333916                   /usr/lib64/python2.7/lib-dynload/mmap.so
7ffff0c6b000-7ffff0c6c000 rw-p 00005000 09:01 50333916                   /usr/lib64/python2.7/lib-dynload/mmap.so
7ffff0c6c000-7ffff0c77000 r-xp 00000000 00:12 54138483                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/streams.so
7ffff0c77000-7ffff0e76000 ---p 0000b000 00:12 54138483                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/streams.so
7ffff0e76000-7ffff0e77000 r--p 0000a000 00:12 54138483                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/streams.so
7ffff0e77000-7ffff0e78000 rw-p 0000b000 00:12 54138483                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/streams.so
7ffff0e78000-7ffff0e79000 rw-p 00000000 00:00 0 
7ffff0e79000-7ffff0e9b000 r-xp 00000000 00:12 54138481                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio5_utils.so
7ffff0e9b000-7ffff109a000 ---p 00022000 00:12 54138481                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio5_utils.so
7ffff109a000-7ffff109b000 r--p 00021000 00:12 54138481                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio5_utils.so
7ffff109b000-7ffff109f000 rw-p 00022000 00:12 54138481                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio5_utils.so
7ffff109f000-7ffff10a0000 rw-p 00000000 00:00 0 
7ffff10a0000-7ffff10a5000 r-xp 00000000 09:01 50333895                   /usr/lib64/python2.7/lib-dynload/zlib.so
7ffff10a5000-7ffff12a4000 ---p 00005000 09:01 50333895                   /usr/lib64/python2.7/lib-dynload/zlib.so
7ffff12a4000-7ffff12a5000 r--p 00004000 09:01 50333895                   /usr/lib64/python2.7/lib-dynload/zlib.so
7ffff12a5000-7ffff12a7000 rw-p 00005000 09:01 50333895                   /usr/lib64/python2.7/lib-dynload/zlib.so
7ffff12a7000-7ffff12ad000 r-xp 00000000 00:12 54138491                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio_utils.so
7ffff12ad000-7ffff14ac000 ---p 00006000 00:12 54138491                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio_utils.so
7ffff14ac000-7ffff14ad000 r--p 00005000 00:12 54138491                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio_utils.so
7ffff14ad000-7ffff14ae000 rw-p 00006000 00:12 54138491                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio_utils.so
7ffff14ae000-7ffff14b5000 r-xp 00000000 00:12 54138562                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_csgraph.so
7ffff14b5000-7ffff16b4000 ---p 00007000 00:12 54138562                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_csgraph.so
7ffff16b4000-7ffff16b5000 r--p 00006000 00:12 54138562                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_csgraph.so
7ffff16b5000-7ffff16b6000 rw-p 00007000 00:12 54138562                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_csgraph.so
7ffff16b6000-7ffff17c2000 r-xp 00000000 00:12 54138558                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_bsr.so
7ffff17c2000-7ffff19c2000 ---p 0010c000 00:12 54138558                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_bsr.so
7ffff19c2000-7ffff19c3000 r--p 0010c000 00:12 54138558                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_bsr.so
7ffff19c3000-7ffff19c6000 rw-p 0010d000 00:12 54138558                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_bsr.so
7ffff19c6000-7ffff19d5000 r-xp 00000000 00:12 54138561                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_dia.so
7ffff19d5000-7ffff1bd4000 ---p 0000f000 00:12 54138561                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_dia.so
7ffff1bd4000-7ffff1bd5000 r--p 0000e000 00:12 54138561                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_dia.so
Program received signal SIGABRT, Aborted.
0x00007ffff6e81ab5 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff6e81ab5 in raise () from /lib64/libc.so.6
#1  0x00007ffff6e82fb6 in abort () from /lib64/libc.so.6
#2  0x00007ffff6ebcdd3 in __libc_message () from /lib64/libc.so.6
#3  0x00007ffff6ec23b6 in malloc_printerr () from /lib64/libc.so.6
#4  0x00007ffff69fc496 in ?? () from /usr/lib64/python2.7/site-packages/numpy/core/multiarray.so
#5  0x00007ffff7af48c7 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.7.so.1.0
#6  0x00007ffff7af6c49 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.7.so.1.0
#7  0x00007ffff7b25592 in PyEval_EvalCode () from /usr/lib64/libpython2.7.so.1.0
#8  0x00007ffff7b33c61 in ?? () from /usr/lib64/libpython2.7.so.1.0
#9  0x00007ffff7b34074 in PyRun_FileExFlags () from /usr/lib64/libpython2.7.so.1.0
#10 0x00007ffff7b347c9 in PyRun_SimpleFileExFlags () from /usr/lib64/libpython2.7.so.1.0
#11 0x00007ffff7b3e1bc in Py_Main () from /usr/lib64/libpython2.7.so.1.0
#12 0x00007ffff6e6dbfd in __libc_start_main () from /lib64/libc.so.6
#13 0x00000000004006e9 in _start ()

When I set the stack size limit to unlimited", I get the following:

*** glibc detected *** /usr/bin/python: double free or corruption (out): 0x00002abb2732c010 ***
^X^C
Program received signal SIGINT, Interrupt.
0x00002aaaab9d08fe in __lll_lock_wait_private () from /lib64/libc.so.6
(gdb) bt
#0  0x00002aaaab9d08fe in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00002aaaab969f2e in _L_lock_9927 () from /lib64/libc.so.6
#2  0x00002aaaab9682d1 in free () from /lib64/libc.so.6
#3  0x00002aaaaaabbfe2 in _dl_scope_free () from /lib64/ld-linux-x86-64.so.2
#4  0x00002aaaaaab70a4 in _dl_map_object_deps () from /lib64/ld-linux-x86-64.so.2
#5  0x00002aaaaaabcaa0 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#6  0x00002aaaaaab85f6 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#7  0x00002aaaaaabc5da in _dl_open () from /lib64/ld-linux-x86-64.so.2
#8  0x00002aaaab9fb530 in do_dlopen () from /lib64/libc.so.6
#9  0x00002aaaaaab85f6 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#10 0x00002aaaab9fb5cf in dlerror_run () from /lib64/libc.so.6
#11 0x00002aaaab9fb637 in __libc_dlopen_mode () from /lib64/libc.so.6
#12 0x00002aaaab9d60c5 in init () from /lib64/libc.so.6
#13 0x00002aaaab080933 in pthread_once () from /lib64/libpthread.so.0
#14 0x00002aaaab9d61bc in backtrace () from /lib64/libc.so.6
#15 0x00002aaaab95dde7 in __libc_message () from /lib64/libc.so.6
#16 0x00002aaaab9633b6 in malloc_printerr () from /lib64/libc.so.6
#17 0x00002aaaab9682dc in free () from /lib64/libc.so.6
#18 0x00002aaaabef1496 in ?? () from /usr/lib64/python2.7/site-packages/numpy/core/multiarray.so
#19 0x00002aaaaad888c7 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.7.so.1.0
#20 0x00002aaaaad8ac49 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.7.so.1.0
#21 0x00002aaaaadb9592 in PyEval_EvalCode () from /usr/lib64/libpython2.7.so.1.0
#22 0x00002aaaaadc7c61 in ?? () from /usr/lib64/libpython2.7.so.1.0
#23 0x00002aaaaadc8074 in PyRun_FileExFlags () from /usr/lib64/libpython2.7.so.1.0
#24 0x00002aaaaadc87c9 in PyRun_SimpleFileExFlags () from /usr/lib64/libpython2.7.so.1.0
#25 0x00002aaaaadd21bc in Py_Main () from /usr/lib64/libpython2.7.so.1.0
#26 0x00002aaaab90ebfd in __libc_start_main () from /lib64/libc.so.6
#27 0x00000000004006e9 in _start ()

This makes me believe the basic issue is with the numpy multiarray core module (line #4 in the first output and line #18 in the second). I will bring it up as a bug report in both numpy and scipy just in case.

Has anyone seen this before?

Update: February 17, 2013 (4:45 PM PST)

I found a machine that I could run the code on that had a more recent version of SciPy (0.11) and NumPy (1.7.0). Running the code straight up (without GDB) resulted in a seg fault without any output to stdout or stderr. Running again through GDB, I get the following:

Program received signal SIGSEGV, Segmentation fault.
0x00002aaaabead970 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00002aaaabead970 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00002aaaac5fcd04 in PyDataMem_FREE (ptr=<optimized out>, $K8=<optimized out>) at numpy/core/src/multiarray/multiarraymodule.c:3510
#2  array_dealloc (self=0xc00ab7edbfc228fe) at numpy/core/src/multiarray/arrayobject.c:416
#3  0x0000000000498eac in PyEval_EvalFrameEx ()
#4  0x000000000049f1c0 in PyEval_EvalCodeEx ()
#5  0x00000000004a9081 in PyRun_FileExFlags ()
#6  0x00000000004a9311 in PyRun_SimpleFileExFlags ()
#7  0x00000000004aa8bd in Py_Main ()
#8  0x00002aaaabe4f76d in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#9  0x000000000041b9b1 in _start ()

I understand this is not as useful as a NumPy compiled with debugging symbols, I will try doing that and post the output later.

like image 523
Avneesh Avatar asked Feb 16 '13 04:02

Avneesh


2 Answers

After discussions on the same issue on the Numpy Github page (https://github.com/numpy/numpy/issues/2995) it has been brought to my attention that Numpy/Scipy will not support such a large number of non-zeros in the resulting sparse matrix.

Basically, W is a sparse matrix, and Q (or np.log(Q)-1) is a dense matrix. When multiplying a dense matrix with a sparse one, the resulting product will also be represented in sparse matrix form (which makes a lot of sense). However, note that since I have no zero rows in my W matrix, the resulting product W*(np.log(Q)-1) will have nnz > 2^31 (2.2 million multiplied by 2000) and this exceeds the maximum number of elements in a sparse matrix in current versions of Scipy.

At this stage, I'm not sure how else to get this to work, barring a re-implementation in another language. Perhaps it can still be done in Python, but it might be better to just write up a C++ and Eigen implementation.

A special thanks to pv. for helping out on this to pinpoint the exact issue, and thanks to everyone else for the brainstorming!

like image 71
Avneesh Avatar answered Sep 20 '22 16:09

Avneesh


Basically, W is a sparse matrix, and Q (or np.log(Q)-1) is a dense matrix. When multiplying a dense matrix with a sparse one, the resulting product will also be represented in sparse matrix form (which makes a lot of sense).

I'm probably missing something really obvious here, and will end up looking like an idiot, but…

If Q is a dense matrix, and you're hoping to store the result as a dense matrix, you probably have enough to hold W as a dense matrix as well. Which means:

W.todense()*(np.log(Q)-1)

Looking at the details, as you calculated in the comments, this would require 35.8GB of temporary memory. Given that you've got 131GB of data and this "fits comfortably into memory", it seems at least plausible that temporarily using another 35.8GB would be reasonable.

If it's not reasonable, you can always decompose the matrix multiplication yourself. Obviously doing it row by row or column by column would make your entire process much slower (maybe not as much as pushing the process over the edge into swapping, but still maybe far too slow to be acceptable). But doing it in, e.g., a chunk of 1GB worth of rows at a time shouldn't be too bad. That would mean temporary storage on the order of a few GB, and probably only a small slowdown. Of course it's more complicated and ugly code, but not unmanageably so.

like image 40
abarnert Avatar answered Sep 19 '22 16:09

abarnert