Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dask Dataframe: Get row count?

Simple question: I have a dataframe in dask containing about 300 mln records. I need to know the exact number of rows that the dataframe contains. Is there an easy way to do this?

When I try to run dataframe.x.count().compute() it looks like it tries to load the entire data into RAM, for which there is no space and it crashes.

like image 725
usbToaster Avatar asked Mar 15 '18 21:03

usbToaster


People also ask

How do I find the number of rows in a DASK DataFrame?

Try len(df) instead. Using len(df) also tries to load the entire dataset into memory for some reason.

Is DASK faster than pandas?

The original pandas query took 182 seconds and the optimized Dask query took 19 seconds, which is about 10 times faster. Dask can provide performance boosts over pandas because it can execute common operations in parallel, where pandas is limited to a single core.

How do you find the length of a data frame?

Get the length of the DataFrame The easiest way to get the length of a pandas DataFrame is by requesting its length using len(). In most cases, this is the most concise way to do it.

Is Dask useful?

Dask can enable efficient parallel computations on single machines by leveraging their multi-core CPUs and streaming data efficiently from disk. It can run on a distributed cluster. Dask also allows the user to replace clusters with a single-machine scheduler which would bring down the overhead.


2 Answers

# ensure small enough block size for the graph to fit in your memory
ddf = dask.dataframe.read_csv('*.csv', blocksize="10MB") 
ddf.shape[0].compute()

From the documentation:

blocksize <str, int or None> Optional Number of bytes by which to cut up larger files. Default value is computed based on available physical memory and the number of cores, up to a maximum of 64MB. Can be a number like 64000000` or a string like ``"64MB". If None, a single block is used for each file.

like image 105
CodeWarrior Avatar answered Sep 22 '22 03:09

CodeWarrior


If you only need the number of rows -
you can load a subset of the columns while selecting the columns with lower memory usage (such as category/integers and not string/object), there after you can run len(df.index)

like image 32
skibee Avatar answered Sep 25 '22 03:09

skibee