python struct.error: 'i' format requires -2147483648 <= number <= 2147483647

Q: What is struct package in Python?

The module struct is used to convert the native data types of Python into string of bytes and vice versa. We don't have to install it. It's a built-in module available in Python3. The struct module is related to the C languages.

Q: What does struct unpack do Python?

Python struct pack_into(), unpack_from() These functions allow us to pack the values into string buffer and unpack from a string buffer. These functions are introduced in version 2.5.

Q: How do you pack a string in Python?

struct.pack() struct. pack() is the function that converts a given list of values into their corresponding string representation. It requires the user to specify the format and order of the values that need to be converted.

Problem

I'm willing to do a feature engineering using multiprocessing module (multiprocessing.Pool.starmap(). However, it gives an error message as follows. I guess this error message is about the size of inputs (2147483647 = 2^31 − 1?), since the same code worked smoothly for a fraction(frac=0.05) of input dataframes(train_scala, test, ts). I convert types of data frame as smallest as possible, however it does not get better.

The anaconda version is 4.3.30 and the Python version is 3.6 (64 bit). And the memory size of the system is over 128GB with more than 20 cores. Would you like to suggest any pointer or solution to overcome this problem? If this problem is caused by a large data for a multiprocessing module, How much smaller data should I use to utilize the multiprocessing module on Python3?

Code:

from multiprocessing import Pool, cpu_count from itertools import repeat     p = Pool(8) is_train_seq = [True]*len(historyCutoffs)+[False] config_zip = zip(historyCutoffs, repeat(train_scala), repeat(test), repeat(ts), ul_parts_path, repeat(members), is_train_seq) p.starmap(multiprocess_FE, config_zip)

Error Message:

Traceback (most recent call last):   File "main_1210_FE_scala_multiprocessing.py", line 705, in <module>     print('----Pool starmap start----')   File "/home/dmlab/ksedm1/anaconda3/envs/py36/lib/python3.6/multiprocessing/pool.py", line 274, in starmap     return self._map_async(func, iterable, starmapstar, chunksize).get()   File "/home/dmlab/ksedm1/anaconda3/envs/py36/lib/python3.6/multiprocessing/pool.py", line 644, in get     raise self._value   File "/home/dmlab/ksedm1/anaconda3/envs/py36/lib/python3.6/multiprocessing/pool.py", line 424, in _handle_tasks     put(task)   File "/home/dmlab/ksedm1/anaconda3/envs/py36/lib/python3.6/multiprocessing/connection.py", line 206, in send     self._send_bytes(_ForkingPickler.dumps(obj))   File "/home/dmlab/ksedm1/anaconda3/envs/py36/lib/python3.6/multiprocessing/connection.py", line 393, in _send_bytes     header = struct.pack("!i", n) struct.error: 'i' format requires -2147483648 <= number <= 2147483647

Extra infos

historyCutoffs is a list of integers
train_scala is a pandas DataFrame (377MB)
test is a pandas DataFrame (15MB)
ts is a pandas DataFrame (547MB)
ul_parts_path is a list of directories (string)
is_train_seq is a list of booleans

Extra Code: Method multiprocess_FE

def multiprocess_FE(historyCutoff, train_scala, test, ts, ul_part_path, members, is_train):     train_dict = {}     ts_dict = {}     msno_dict = {}     ul_dict = {}     if is_train == True:         train_dict[historyCutoff] = train_scala[train_scala.historyCutoff == historyCutoff]     else:         train_dict[historyCutoff] = test     msno_dict[historyCutoff] = set(train_dict[historyCutoff].msno)     print('length of msno is {:d} in cutoff {:d}'.format(len(msno_dict[historyCutoff]), historyCutoff))     ts_dict[historyCutoff] = ts[(ts.transaction_date <= historyCutoff) & (ts.msno.isin(msno_dict[historyCutoff]))]     print('length of transaction is {:d} in cutoff {:d}'.format(len(ts_dict[historyCutoff]), historyCutoff))         ul_part = pd.read_csv(gzip.open(ul_part_path, mode="rt"))  ##.sample(frac=0.01, replace=False)     ul_dict[historyCutoff] = ul_part[ul_part.msno.isin(msno_dict[historyCutoff])]     train_dict[historyCutoff] = enrich_by_features(historyCutoff, train_dict[historyCutoff], ts_dict[historyCutoff], ul_dict[historyCutoff], members, is_train)

408

asked Dec 12 '17 15:12

SUNDONG

2 Answers

The communication protocol between processes uses pickling, and the pickled data is prefixed with the size of the pickled data. For your method, all arguments together are pickled as one object.

You produced an object that when pickled is larger than fits in a i struct formatter (a four-byte signed integer), which breaks the assumptions the code has made.

You could delegate reading of your dataframes to the child process instead, only sending across the metadata needed to load the dataframe. Their combined size is nearing 1GB, way too much data to share over a pipe between your processes.

Quoting from the Programming guidelines section:

Better to inherit than pickle/unpickle

When using the spawn or forkserver start methods many types from multiprocessing need to be picklable so that child processes can use them. However, one should generally avoid sending shared objects to other processes using pipes or queues. Instead you should arrange the program so that a process which needs access to a shared resource created elsewhere can inherit it from an ancestor process.

If you are not running on Windows and use either the spawn or forkserver methods, you could load your dataframes as globals before starting your subprocesses, at which point the child processes will 'inherit' the data via the normal OS copy-on-write memory page sharing mechanisms.

Note that this limit was raised for non-Windows systems in Python 3.8, to an unsigned long long (8 bytes), and so you can now send and receive 4 EiB of data. See this commit, and Python issues #35152 and #17560.

If you can't upgrade and you can't make use of resource inheriting, and are not running on Windows, then use this patch:

import functools import logging import struct import sys  logger = logging.getLogger()   def patch_mp_connection_bpo_17560():     """Apply PR-10305 / bpo-17560 connection send/receive max size update      See the original issue at https://bugs.python.org/issue17560 and      https://github.com/python/cpython/pull/10305 for the pull request.      This only supports Python versions 3.3 - 3.7, this function     does nothing for Python versions outside of that range.      """     patchname = "Multiprocessing connection patch for bpo-17560"     if not (3, 3) < sys.version_info < (3, 8):         logger.info(             patchname + " not applied, not an applicable Python version: %s",             sys.version         )         return      from multiprocessing.connection import Connection      orig_send_bytes = Connection._send_bytes     orig_recv_bytes = Connection._recv_bytes     if (         orig_send_bytes.__code__.co_filename == __file__         and orig_recv_bytes.__code__.co_filename == __file__     ):         logger.info(patchname + " already applied, skipping")         return      @functools.wraps(orig_send_bytes)     def send_bytes(self, buf):         n = len(buf)         if n > 0x7fffffff:             pre_header = struct.pack("!i", -1)             header = struct.pack("!Q", n)             self._send(pre_header)             self._send(header)             self._send(buf)         else:             orig_send_bytes(self, buf)      @functools.wraps(orig_recv_bytes)     def recv_bytes(self, maxsize=None):         buf = self._recv(4)         size, = struct.unpack("!i", buf.getvalue())         if size == -1:             buf = self._recv(8)             size, = struct.unpack("!Q", buf.getvalue())         if maxsize is not None and size > maxsize:             return None         return self._recv(size)      Connection._send_bytes = send_bytes     Connection._recv_bytes = recv_bytes      logger.info(patchname + " applied")

188

answered Sep 20 '22 10:09

Martijn Pieters

this problem was fixed in a recent PR to python https://github.com/python/cpython/pull/10305

if you want, you can make this change locally to make it work for you right away, without waiting for a python and anaconda release.

answered Sep 20 '22 10:09

Alex

Related questions
                            
                                Pandas drop_duplicates method not working on dataframe containing lists
                            
                                Why in argparse, a 'True' is always 'True'? [duplicate]
                            
                                After conda update, python kernel crashes when matplotlib is used
                            
                                Why is my instance variable not in __dict__?
                            
                                Multiple keys per value
                            
                                Python numpy 2D array indexing
                            
                                iterrows pandas get next rows value
                            
                                How to generate a random normal distribution of integers
                            
                                How to open a huge excel file efficiently
                            
                                AWS BOTO3 S3 python - An error occurred (404) when calling the HeadObject operation: Not Found
                            
                                Django templates and variable attributes
                            
                                Check Linux distribution name
                            
                                Is this the way to validate Django model fields?
                            
                                Flask confusion with app
                            
                                Python TypeError: unsupported operand type(s) for ^: 'float' and 'int'
                            
                                Python frequency detection
                            
                                Python: Elegantly merge dictionaries with sum() of values [duplicate]
                            
                                How to construct a TarFile object in memory from byte buffer in Python 3?
                            
                                How to send an email through gmail without enabling 'insecure access'?
                            
                                Numpy - the best way to remove the last element from 1 dimensional array?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

python struct.error: 'i' format requires -2147483648 <= number <= 2147483647

Tags:

python

python-3.x

struct

multiprocessing

starmap