How do I find the length of a dataframe using dask?
For example in pandas, I can do:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.normal(0, 1, (5, 2)), columns=["A", "B"])
print df['A'].count()
print df
Output:
5
A B
0 1.538531 0.424717
1 -0.929843 1.323648
2 -1.283680 0.056199
3 -0.641035 -1.998241
4 -0.058598 -1.400637
In dask I try:
import dask.dataframe as dd
df_dask = dd.from_pandas(df, npartitions=3)
print df_dask
print df_dask['A'].count()
Output:
A B
npartitions=2
0 float64 float64
2 ... ...
4 ... ...
Dask Name: from_pandas, 2 tasks
dd.Scalar<series-..., dtype=int32>
The real reason I need length is because df_dask.sample() takes a fraction and I want to sample a specified number of entries from the dataframe. I use length to compute this fraction. Is there an easier/faster way of doing this?
Get Number of Rows in DataFrame You can use len(df. index) to find the number of rows in pandas DataFrame, df. index returns RangeIndex(start=0, stop=8, step=1) and use it on len() to get the count.
Get the number of columns: len(df. columns) The number of columns of pandas. DataFrame can be obtained by applying len() to the columns attribute.
You can use len
for length of dask DataFrame column
or index
:
print (len(df_dask['A']))
5
print (len(df_dask.index))
5
Your solution is beter if need count all non NaN
s values - add compute
:
df = pd.DataFrame(np.random.normal(0, 1, (5, 2)), columns=["A", "B"])
df.loc[0, 'A'] = np.nan
print (df)
A B
0 NaN -1.727669
1 -0.390900 0.573806
2 0.338589 -0.011830
3 2.392365 0.412912
4 0.978736 2.238143
import dask.dataframe as dd
df_dask = dd.from_pandas(df, npartitions=3)
print (df_dask['A'].count().compute())
4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With