Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Weird behavior in large complex128 NumPy arrays, imaginary part only [closed]

I'm working on numerical simulations. I ran into an issue with large NumPy arrays (~ 26 GB) on Linux with 128 GB of RAM. The arrays are of type complex128.

  • Arrays are instantiated without errors (if they fail they do so quietly).
  • Assigning array values proceeds similarly without complaint.
  • Memory appears to be properly allocated, or at least the correct amount is reserved by the process.

I query the minimum and maximum values of the real and imaginary parts:

  • The real parts are always correct.
  • The imaginary parts are often incorrect.
import numpy as np
import numba


@numba.jit(cache=True)
def minmaxrealimag(x):
    rmaximum = x[0].real
    rminimum = x[0].real
    imaximum = x[0].imag
    iminimum = x[0].imag

    for i in x[1:]:
        if i.real > rmaximum:
            rmaximum = i.real
        elif i.real < rminimum:
            rminimum = i.real
        if i.imag > imaximum:
            imaximum = i.imag
        elif i.imag < iminimum:
            iminimum = i.imag
    return (rminimum, rmaximum, iminimum, imaximum)


testn = 2048
testx = 1024


field = np.empty((testx, testx), dtype=complex)
field.real[:] = np.random.rand(*field.shape)[:]
field.imag[:] = np.random.rand(*field.shape)[:]
profile = np.empty(testn, dtype=complex)
profile.real[:] = np.random.rand(*profile.shape)[:]
profile.imag[:] = np.random.rand(*profile.shape)[:]
correctslice = field[:, :] * profile[0]
(rmin, rmax, imin, imax) = minmaxrealimag(correctslice.flatten())

test1 = np.empty((testn, testx, testx), dtype=complex)
for itau in range(testn):
    correctslice = field[:, :] * profile[itau]
    (rmin2, rmax2, imin2, imax2) = minmaxrealimag(correctslice.flatten())
    if rmin2 < rmin:
        rmin = rmin2
    elif rmax2 > rmax:
        rmax = rmax2
    if imin2 < imin:
        imin = imin2
    elif imax2 > imax:
        imax = imax2

    test1[itau] = correctslice[:, :]

print((rmin, rmax, imin, imax))
print(minmaxrealimag(test1.flatten()))

Simpler example (slower and less informative):

import numpy as np

testn = 2048
testx = 1024


field = np.empty((testn, testx, testx), dtype=complex)
field.real[:] = np.random.rand(*field.shape)[:]
field.imag[:] = np.random.rand(*field.shape)[:]

print(np.max(field.imag))

Sometimes everything goes fine, but usually the minimum and maximum of the imaginary parts are incorrect. For these examples the minimum is reasonable but wrong and the maximum is either nan (very rare) or (more likely) close to the double-precision floating point maximum, though a factor of two or four smaller. Sometimes the maximum has an exponent that is a factor of two or four smaller than the float64 max. I've never seen it be Inf. From this I assume that this weird maximum value in the imaginary part has the following properties:

  • The sign bit is always 0.
  • The mantissa is usually all 1s.
  • Bit 11 is almost always 0.
  • Bit 12 is almost always 1.

When I try to locate and explicitly assign a value (say, 1) to the offending pixel, the value remains unchanged.

The issue isn't always replicable in these snippets, but it's consistently stalling my simulations that use multiple arrays of this size and frequently occupies >80% of system memory. The frustrating part is the lack of an error, segfault, or exception, or else I'd think this was a memory-safety issue. Not even a warning. I have no way to know if a simulation is going to work or going off the rails.

The obvious answer is to move to a distributed-memory model of parallelization but I first want to know that the issue isn't on my end, aside from asking too much of my computer. I haven't tried this on a different computer as none of my other have as much memory. However, the computer in question is behaving normally in every other way.

like image 874
laserpropsims Avatar asked Nov 14 '25 15:11

laserpropsims


1 Answers

So yeah, seems like it's a memory issue, thanks to @KellyBundy for the suggestion that I run MemTest86.

There were so many errors in the test that it just gave up when it hit 100000, which is odd, because the system boots just fine and I've never had a problem with crashes (hence why I didn't immediately suspect a hardware problem). Even the simulations run fine (usually) until they reach a certain size. But the memory test was showing a multitude of single-bit errors, always in the first two bytes. I'm not that experienced with this kind of problem, but I tested each of the four modules in each of the four DIMM slots individually, and they all failed all the time, so I think it's probably either a PSU problem or a bad memory controller on the CPU, but until I can find a known-good PSU to swap in, I won't know which (I don't have access to a PSU tester). For reference, there's 128GB of non-ECC UDIMM, which in hindsight may have been a little ambitious. The CPU is a Ryzen 9 3900X.

like image 196
laserpropsims Avatar answered Nov 17 '25 08:11

laserpropsims



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!