Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I see the data preview of Dask DataFrame?

I created Dask DataFrame from Pandas DataFrame and applied few functions on it. When I'm trying to view the data using

 df.head()

it is taking too much time. How can I view the dataframe?

like image 219
Hari Avatar asked Oct 17 '22 15:10

Hari


1 Answers

It really depends on what computations are behind your dataframe.

The df.head() command executes only those operations necessary to get a few lines of data from the dataframe. Often this is very fast. For example if we are reading a large dataframe from a Parquet or CSV file then we only need to load in the first chunk of data to get the first few rows.

df = dd.read_csv('...')
df.head()  # this is relatively fast

However if our dataframe is more complex, maybe it is the result of a lazy shuffle or set_index operation, then we might genuinely need to read and process all of our data before we can get the first few rows.

df = df.set_index('some-column')
df = df.merge(some_other_df)
df.head()  # this is slow, because it has to do the set_index and merge

You can always see metadata cheaply (column names, types, number of tasks and partitions).

>>> df
Dask DataFrame Structure:
                       close     high      low     open
npartitions=505                                        
2008-01-02 09:00:00  float64  float64  float64  float64
2008-01-03 09:00:00      ...      ...      ...      ...
...                      ...      ...      ...      ...
2009-12-31 09:00:00      ...      ...      ...      ...
2009-12-31 16:00:00      ...      ...      ...      ...
Dask Name: from-delayed, 1010 tasks

Persist

If your data fits in RAM (or distributed RAM if you're on a cluster) then you should also persist to memory. This will make things very fast.

df = df.persist()

However if you don't have enough RAM then this may slow down your machine.

like image 147
MRocklin Avatar answered Oct 20 '22 11:10

MRocklin