Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What can cause IOError: bad message length

I am using python scripts to manipulate and extract information from 4D images (functional MRI scans). Part of the analysis is setup to run in parallel (for every subject) using the multiprocessing package:

pool = Pool(processes=numberCores)
resultList = pool.map(SubjectProcesser, argList) # where arglist is the list of arguments passed to the process

These are applied to different kinds of files and different types of analysis. For one specific type of analysis, I get the following error:

Process PoolWorker-1:
Traceback (most recent call last):
File "/home2/user/epd/epd-7.2-2-rh5-x86_64/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  self.run()
File "/home2/user/epd/epd-7.2-2-rh5-x86_64/lib/python2.7/multiprocessing/process.py", line 114, in run
  self._target(*self._args, **self._kwargs)
File "/home2/surchs/epd/epd-7.2-2-rh5-x86_64/lib/python2.7/multiprocessing/pool.py", line 99, in worker
  put((job, i, result))
File "/home2/user/epd/epd-7.2-2-rh5-x86_64/lib/python2.7/multiprocessing/queues.py", line 392, in put
  return send(obj)
IOError: bad message length

I have narrowed it down to the point where it fails. The parallel processes apparently execute OK (determined by looking at my various debug printouts during different stages of my scripts) but then the failure happens during the remapping of the results.

I have searched for this error message but haven't found any solutions yet. Since my scripts do work for all other types of analysis I am wondering what might be going on.

A bit about the analysis since I guess this plays into the problem

The different analysis are more or less timeseries extractions of voxels in the brain (imagine a brain as a 3D matrix with time as the fourth dimension and matrix elements are called voxels). Any point in the brain has an activation value for every point in time. Then a timeseries is the vector of all activation values of a given voxel over time.

I then calculate the correlation coefficient between all the voxels (gives me a square correlation matrix with dimensions voxels by voxels) and return a vector of all the correlation coefficients (lower triangle of the matrix) as the output of the parallel processing.

Now for all the analyses that don't throw the error, I am averaging multiple voxels (based on regional nodes) and then using the average timeseries for this region - effectively doing two things:

  1. drastically reducing the number of voxels (to the number of regions)
  2. getting rid of voxels which are always zero (as a result of the averaging. no region will contain only zero voxels)

In contrast, the analysis that gives the above mentioned error uses all voxels timeseries in the brain, resulting in a much larger correlation matrix.

I tried to get rid of zero-voxels by masking every subjects file and also, I am not getting any 'division by zero' errors but these are the only two things I can think of.

Also, as said above, the parallel part of the processing runs through without problems. The error get's thrown after it ran, possibly during the remapping of the results.

Any help would be greatly appreciated. Also, if I should provide additional details, please let me know.

like image 617
surchs Avatar asked Nov 04 '22 01:11

surchs


1 Answers

I have been experiencing the same issue when the object I return from my child processes grow too large (in my cases tens of giga bytes). These huge objects need to be pickled and send back to the parent process through process communication, and that might be cause of the issue. Of course, even if I was not getting this error, it was a bad idea to move around tens of gigabytes of data. So, my solutions was to change the structure of my program to eliminate the need for passing around such large objects.

One thing you might be able to do is to use Shared Memory. I have not had much luck with this because my objects are very complex and not easy to create in shared memory without a lot of code change, but yours might be easier to manage.

See also this other thread: Shared-memory objects in python multiprocessing.

like image 148
stacksia Avatar answered Nov 15 '22 12:11

stacksia