So lets say You have a Python process which is collecting data realtime with around 500 rows per second (this can be further parallelized to reduce to around 50 p.s.) from a queueing system and appending it to a <code>DataFrame</code>: <pre class="prettyprint"><code>rq = MyRedisQueue(..) df = pd.DataFrame() while 1: recv = rq.get(block=True) # some converting df.append(recv, ignore_index = True) </code></pre> Now the question is: How to utilize the CPUs based on this data? So I am fully aware of the limitations of the GIL, and looked into multiprocessing Manager namespace, here too, but it looks like there are some drawbacks with regard to latency on the centerally hold dataframe. Before digging into it, I also tried <code>pool.map</code> which I than recognized to apply <code>pickle</code> between the processes, which is way to slow and has too much overhead. So after all of this I finally wonder, how (if) an insert of 500 rows per second (or even 50 rows per second) can be transfered to different processes with some CPU time left for applying statistics and heuristics on the data in the child processes? Maybe it would be better to implement a custom tcp socket or queueing system between the two processes? Or are there some implementations in <code>pandas</code> or other libaries to really allow a fast access to the one big dataframe in the parent process? I love pandas!

Before we start I should say that you didn't tell us much about your code but have this point in your mind to only transfer those 50/500 new rows each second to the child process and try to create that big <code>DataFrame</code> in child process. I'm working on a project exactly as same as you. Python got many IPC implementation such as <code>Pipe</code> and <code>Queue</code> as you know. <code>Shared Memory</code> solution may be problematic in many cases, AFAIK python official documentation warned about using shared memories. In my experience the best way to transform data between only two processes is <code>Pipe</code> , so you can pickle DataFrame and send it to the other connection end point. I strongly suggest you to avoid <code>TCP</code> sockets ( <code>AF_INET</code> ) in your case. Pandas <code>DataFrame</code> cannot be transformed to another process without getting pickled and unpickled. so I also recommend you to transfer raw data as built-in types like <code>dict</code> instead of DataFrame. This might make pickle and unpicking faster and also it has less memory footprints.

Parallelisation in <code>pandas</code> is probably better handled by another engine altogether. Have a look at the Koalas project by Databricks or Dask's DataFrame.

Update a DataFrame in different python processes realtime

Tags:

python-3.x

pandas

dataframe

python-multiprocessing

So lets say You have a Python process which is collecting data realtime with around 500 rows per second (this can be further parallelized to reduce to around 50 p.s.) from a queueing system and appending it to a DataFrame:

rq = MyRedisQueue(..)
df = pd.DataFrame()
while 1:
    recv = rq.get(block=True)
    # some converting
    df.append(recv, ignore_index = True)

Now the question is: How to utilize the CPUs based on this data? So I am fully aware of the limitations of the GIL, and looked into multiprocessing Manager namespace, here too, but it looks like there are some drawbacks with regard to latency on the centerally hold dataframe. Before digging into it, I also tried pool.map which I than recognized to apply pickle between the processes, which is way to slow and has too much overhead.

So after all of this I finally wonder, how (if) an insert of 500 rows per second (or even 50 rows per second) can be transfered to different processes with some CPU time left for applying statistics and heuristics on the data in the child processes?

Maybe it would be better to implement a custom tcp socket or queueing system between the two processes? Or are there some implementations in pandas or other libaries to really allow a fast access to the one big dataframe in the parent process? I love pandas!

710

asked Mar 15 '20 22:03

gies0r

2 Answers

Before we start I should say that you didn't tell us much about your code but have this point in your mind to only transfer those 50/500 new rows each second to the child process and try to create that big DataFrame in child process.

I'm working on a project exactly as same as you. Python got many IPC implementation such as Pipe and Queue as you know. Shared Memory solution may be problematic in many cases, AFAIK python official documentation warned about using shared memories.

In my experience the best way to transform data between only two processes is Pipe , so you can pickle DataFrame and send it to the other connection end point. I strongly suggest you to avoid TCP sockets ( AF_INET ) in your case.

Pandas DataFrame cannot be transformed to another process without getting pickled and unpickled. so I also recommend you to transfer raw data as built-in types like dict instead of DataFrame. This might make pickle and unpicking faster and also it has less memory footprints.

answered Oct 06 '22 15:10

AmirHmZ

Parallelisation in pandas is probably better handled by another engine altogether.

Have a look at the Koalas project by Databricks or Dask's DataFrame.

answered Oct 06 '22 15:10

jorijnsmit

Related questions
                            
                                python-gitlab api SSL bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')],
                            
                                Memory leak when running python script from C++
                            
                                Need to make a cartoon comic version of a picture with Python and OpenCV
                            
                                How to write shebang when using features of minor versions
                            
                                Overwrite directory with shutil.rmtree and os.mkdir sometimes gives 'Access is denied' error
                            
                                In SQLAlchemy, why is my load_only not filtering any columns that I have specified?
                            
                                How to check if an array is in another array in Python
                            
                                SyntaxError: invalid syntax with variable annotation
                            
                                Google API Scope Changed
                            
                                How to encode a text stream into a byte stream in Python 3?
                            
                                OSError: Unable to open file (unable to open file)
                            
                                Does asyncio.as_completed yield Futures or coroutines?
                            
                                How to reset an asyncio eventloop by a worker?
                            
                                Pyinstaller with pandas and numpy, exe throws error at runtime
                            
                                Python property setter on a python list
                            
                                How to import from a sibling directory in python3?
                            
                                How to convert subtitle file to have only one sentence per subtitle?
                            
                                object is subclassed during dynamic type creation but not during classic class definition in python2
                            
                                TypeError: expected str, bytes or os.PathLike object, not None Type
                            
                                How do I annotate a Python function to hint that it takes the same arguments as another function?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With