I have a dask dataframe created from a csv file and len(daskdf)
returns 18000 but when I ddSample = daskdf.sample(2000)
I get the error
ValueError: Cannot take a larger sample than population when 'replace=False'
Can I sample without replacement if the dataframe is larger than the sample size?
In general, larger sample sizes decrease the sampling error, however this decrease is not directly proportional. As a rough rule of thumb, you need to increase the sample size fourfold to halve the sampling error.
Dask runs faster than pandas for this query, even when the most inefficient column type is used, because it parallelizes the computations. pandas only uses 1 CPU core to run the query. My computer has 4 cores and Dask uses all the cores to run the computation.
You should aim for partitions that have around 100MB of data each. Additionally, reducing partitions is very helpful just before shuffling, which creates n log(n) tasks relative to the number of partitions. DataFrames with less than 100 partitions are much easier to shuffle than DataFrames with tens of thousands.
The sample method only supports the frac=
keyword argument. See the API documentation
The error that you're getting is from Pandas, not Dask.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [1]})
In [3]: df.sample(frac=2000, replace=False)
ValueError: Cannot take a larger sample than population when 'replace=False'
As the Pandas error suggests, consider sampling with replacement
In [4]: df.sample(frac=2, replace=True)
Out[4]:
x
0 1
0 1
In [5]: import dask.dataframe as dd
In [6]: ddf = dd.from_pandas(df, npartitions=1)
In [7]: ddf.sample(frac=2, replace=True).compute()
Out[7]:
x
0 1
0 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With