So lets say You have a Python process which is collecting data realtime with around 500 rows per second (this can be further parallelized to reduce to around 50 p.s.) from a queueing system and appending it to a DataFrame
:
rq = MyRedisQueue(..)
df = pd.DataFrame()
while 1:
recv = rq.get(block=True)
# some converting
df.append(recv, ignore_index = True)
Now the question is: How to utilize the CPUs based on this data? So I am fully aware of the limitations of the GIL, and looked into multiprocessing Manager namespace, here too, but it looks like there are some drawbacks with regard to latency on the centerally hold dataframe. Before digging into it, I also tried pool.map
which I than recognized to apply pickle
between the processes, which is way to slow and has too much overhead.
So after all of this I finally wonder, how (if) an insert of 500 rows per second (or even 50 rows per second) can be transfered to different processes with some CPU time left for applying statistics and heuristics on the data in the child processes?
Maybe it would be better to implement a custom tcp socket or queueing system between the two processes? Or are there some implementations in pandas
or other libaries to really allow a fast access to the one big dataframe in the parent process? I love pandas!
Pandas DataFrame update() Method The update() method updates a DataFrame with elements from another similar object (like another DataFrame). Note: this method does NOT return a new DataFrame. The updating is done to the original DataFrame.
To summarize, if your apps save/load data from disk frequently, then it's a wise decision to leave these operations to PyArrow. Heck, it's 7 times faster for the identical file format. Imagine we introduced Parquet file format to the mix.
Before we start I should say that you didn't tell us much about your code but have this point in your mind to only transfer those 50/500 new rows each second to the child process and try to create that big DataFrame
in child process.
I'm working on a project exactly as same as you. Python got many IPC implementation such as Pipe
and Queue
as you know. Shared Memory
solution may be problematic in many cases, AFAIK python official documentation warned about using shared memories.
In my experience the best way to transform data between only two processes is Pipe
, so you can pickle DataFrame and send it to the other connection end point. I strongly suggest you to avoid TCP
sockets ( AF_INET
) in your case.
Pandas DataFrame
cannot be transformed to another process without getting pickled and unpickled. so I also recommend you to transfer raw data as built-in types like dict
instead of DataFrame. This might make pickle and unpicking faster and also it has less memory footprints.
Parallelisation in pandas
is probably better handled by another engine altogether.
Have a look at the Koalas project by Databricks or Dask's DataFrame.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With