How do I find the length of a dataframe in dask?

Tags:

How do I find the length of a dataframe using dask?

For example in pandas, I can do:

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.normal(0, 1, (5, 2)), columns=["A", "B"])
print df['A'].count()
print df

Output:

5
          A         B
0  1.538531  0.424717
1 -0.929843  1.323648
2 -1.283680  0.056199
3 -0.641035 -1.998241
4 -0.058598 -1.400637

In dask I try:

import dask.dataframe as dd
df_dask = dd.from_pandas(df, npartitions=3)
print df_dask
print df_dask['A'].count()

Output:

                     A        B
npartitions=2                  
0              float64  float64
2                  ...      ...
4                  ...      ...
Dask Name: from_pandas, 2 tasks

dd.Scalar<series-..., dtype=int32>

The real reason I need length is because df_dask.sample() takes a fraction and I want to sample a specified number of entries from the dataframe. I use length to compute this fraction. Is there an easier/faster way of doing this?

815

asked May 28 '18 15:05

C. L.

1 Answers

You can use len for length of dask DataFrame column or index:

print (len(df_dask['A']))
5

print (len(df_dask.index))
5

Your solution is beter if need count all non NaNs values - add compute:

df = pd.DataFrame(np.random.normal(0, 1, (5, 2)), columns=["A", "B"])
df.loc[0, 'A'] = np.nan
print (df)
          A         B
0       NaN -1.727669
1 -0.390900  0.573806
2  0.338589 -0.011830
3  2.392365  0.412912
4  0.978736  2.238143

import dask.dataframe as dd
df_dask = dd.from_pandas(df, npartitions=3)

print (df_dask['A'].count().compute())
4

200

answered Oct 10 '22 10:10

jezrael

Related questions
                            
                                What exactly are the csv module's Dialect settings for excel-tab?
                            
                                Pandas groupby on a column of lists
                            
                                Why is my implementations of the log-loss (or cross-entropy) not producing the same results?
                            
                                Can't load URL: The domain of this URL isn't included in the app's domains. Django Facebook Auth
                            
                                PyOpenSSL - how can I get SAN(Subject Alternative Names) list
                            
                                How to crop the biggest object in image with python opencv?
                            
                                Last Record in Column, SQLALCHEMY
                            
                                Raise MigrationSchemaMissing("Unable to create the django_migrations table (%s)" % exc)
                            
                                How to get the same hash in Python3 and Mac / Linux terminal?
                            
                                Python: variable naming convention - file, path, filepath, file_path
                            
                                Flask Admin extend "with select"-dropdown menu with custom button
                            
                                Second derivative in Keras
                            
                                How to use parallel_interleave in TensorFlow
                            
                                ERROR:gpu_process_transport_factory.cc(1007)-Lost UI shared context : while initializing Chrome browser through ChromeDriver in Headless mode
                            
                                Convert epoch time to formatted date string in pandas dataframe
                            
                                Center a label inside a circle with matplotlib
                            
                                Pandas: Get per-year counts for Dateranges spanning multiple years
                            
                                What is the difference between md5sum output and Python hashlib output?
                            
                                I get NotImplementedError when trying to do a prepared statement with mysql python connector
                            
                                QGtkStyle could not resolve GTK

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I find the length of a dataframe in dask?

Tags:

python

pandas

dask

C. L.

People also ask

1 Answers

jezrael

Recent Activity

Donate For Us