Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Update a DataFrame in different python processes realtime

So lets say You have a Python process which is collecting data realtime with around 500 rows per second (this can be further parallelized to reduce to around 50 p.s.) from a queueing system and appending it to a DataFrame:

rq = MyRedisQueue(..)
df = pd.DataFrame()
while 1:
    recv = rq.get(block=True)
    # some converting
    df.append(recv, ignore_index = True)

Now the question is: How to utilize the CPUs based on this data? So I am fully aware of the limitations of the GIL, and looked into multiprocessing Manager namespace, here too, but it looks like there are some drawbacks with regard to latency on the centerally hold dataframe. Before digging into it, I also tried pool.map which I than recognized to apply pickle between the processes, which is way to slow and has too much overhead.

So after all of this I finally wonder, how (if) an insert of 500 rows per second (or even 50 rows per second) can be transfered to different processes with some CPU time left for applying statistics and heuristics on the data in the child processes?

Maybe it would be better to implement a custom tcp socket or queueing system between the two processes? Or are there some implementations in pandas or other libaries to really allow a fast access to the one big dataframe in the parent process? I love pandas!

like image 710
gies0r Avatar asked Mar 15 '20 22:03

gies0r


People also ask

How do I update an existing DataFrame in Python?

Pandas DataFrame update() Method The update() method updates a DataFrame with elements from another similar object (like another DataFrame). Note: this method does NOT return a new DataFrame. The updating is done to the original DataFrame.

Is PyArrow faster than pandas?

To summarize, if your apps save/load data from disk frequently, then it's a wise decision to leave these operations to PyArrow. Heck, it's 7 times faster for the identical file format. Imagine we introduced Parquet file format to the mix.


2 Answers

Before we start I should say that you didn't tell us much about your code but have this point in your mind to only transfer those 50/500 new rows each second to the child process and try to create that big DataFrame in child process.

I'm working on a project exactly as same as you. Python got many IPC implementation such as Pipe and Queue as you know. Shared Memory solution may be problematic in many cases, AFAIK python official documentation warned about using shared memories.

In my experience the best way to transform data between only two processes is Pipe , so you can pickle DataFrame and send it to the other connection end point. I strongly suggest you to avoid TCP sockets ( AF_INET ) in your case.

Pandas DataFrame cannot be transformed to another process without getting pickled and unpickled. so I also recommend you to transfer raw data as built-in types like dict instead of DataFrame. This might make pickle and unpicking faster and also it has less memory footprints.

like image 71
AmirHmZ Avatar answered Oct 06 '22 15:10

AmirHmZ


Parallelisation in pandas is probably better handled by another engine altogether.

Have a look at the Koalas project by Databricks or Dask's DataFrame.

like image 1
jorijnsmit Avatar answered Oct 06 '22 15:10

jorijnsmit