I created Dask DataFrame from Pandas DataFrame and applied few functions on it. When I'm trying to view the data using
df.head()
it is taking too much time. How can I view the dataframe?
It really depends on what computations are behind your dataframe.
The df.head()
command executes only those operations necessary to get a few lines of data from the dataframe. Often this is very fast. For example if we are reading a large dataframe from a Parquet or CSV file then we only need to load in the first chunk of data to get the first few rows.
df = dd.read_csv('...')
df.head() # this is relatively fast
However if our dataframe is more complex, maybe it is the result of a lazy shuffle or set_index operation, then we might genuinely need to read and process all of our data before we can get the first few rows.
df = df.set_index('some-column')
df = df.merge(some_other_df)
df.head() # this is slow, because it has to do the set_index and merge
You can always see metadata cheaply (column names, types, number of tasks and partitions).
>>> df
Dask DataFrame Structure:
close high low open
npartitions=505
2008-01-02 09:00:00 float64 float64 float64 float64
2008-01-03 09:00:00 ... ... ... ...
... ... ... ... ...
2009-12-31 09:00:00 ... ... ... ...
2009-12-31 16:00:00 ... ... ... ...
Dask Name: from-delayed, 1010 tasks
If your data fits in RAM (or distributed RAM if you're on a cluster) then you should also persist to memory. This will make things very fast.
df = df.persist()
However if you don't have enough RAM then this may slow down your machine.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With