Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dask: Update published dataset periodically and pull data from other clients

I would like to append data on a published dask dataset from a queue (like redis). Then other python programs would be able to fetch the latest data (e.g. once per second/minute) and do some futher opertions.

  1. Would that be possible?
  2. Which append interface should be used? Should I load it into a pd.DataFrame first or better use some text importer?
  3. What are the assumed append speeds? Is it possible to append lets say 1k/10k rows in a second?
  4. Are there other good suggestions to exchange huge and rapidly updating datasets within a dask cluster?

Thanks for any tips and advice.

like image 863
gies0r Avatar asked Oct 28 '25 06:10

gies0r


1 Answers

You have a few options here.

  • You might take a look at the streamz project
  • You might take a look at Dask's coordination primitives

What are the assumed append speeds? Is it possible to append lets say 1k/10k rows in a second?

Dask is just tracking remote data. The speed of your application has a lot more to do with how you choose to represent that data (like python lists vs pandas dataframes) than with Dask. Dask can handle thousands of tasks a second. Each of those tasks could have a single row, or millions of rows. It's up to how you build it.

like image 109
MRocklin Avatar answered Oct 30 '25 15:10

MRocklin



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!