I have a dask dataframe created from a csv file and <code>len(daskdf)</code> returns 18000 but when I <code>ddSample = daskdf.sample(2000)</code> I get the error <pre class="prettyprint"><code>ValueError: Cannot take a larger sample than population when 'replace=False' </code></pre> Can I sample without replacement if the dataframe is larger than the sample size?

The sample method only supports the <code>frac=</code> keyword argument. See the API documentation The error that you're getting is from Pandas, not Dask. <pre class="prettyprint"><code>In [1]: import pandas as pd In [2]: df = pd.DataFrame({'x': [1]}) In [3]: df.sample(frac=2000, replace=False) ValueError: Cannot take a larger sample than population when 'replace=False' </code></pre> <h3>Solution</h3> As the Pandas error suggests, consider sampling with replacement <pre class="prettyprint"><code>In [4]: df.sample(frac=2, replace=True) Out[4]: x 0 1 0 1 In [5]: import dask.dataframe as dd In [6]: ddf = dd.from_pandas(df, npartitions=1) In [7]: ddf.sample(frac=2, replace=True).compute() Out[7]: x 0 1 0 1 </code></pre>

Sampling n= 2000 from a Dask Dataframe of len 18000 generates error Cannot take a larger sample than population when 'replace=False'

Tags:

python

dask

I have a dask dataframe created from a csv file and len(daskdf) returns 18000 but when I ddSample = daskdf.sample(2000) I get the error

ValueError: Cannot take a larger sample than population when 'replace=False'

Can I sample without replacement if the dataframe is larger than the sample size?

600

asked Aug 26 '16 23:08

mobcdi

1 Answers

The sample method only supports the frac= keyword argument. See the API documentation

The error that you're getting is from Pandas, not Dask.

In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [1]})
In [3]: df.sample(frac=2000, replace=False)
ValueError: Cannot take a larger sample than population when 'replace=False'

Solution

As the Pandas error suggests, consider sampling with replacement

In [4]: df.sample(frac=2, replace=True)
Out[4]: 
   x
0  1
0  1

In [5]: import dask.dataframe as dd
In [6]: ddf = dd.from_pandas(df, npartitions=1)
In [7]: ddf.sample(frac=2, replace=True).compute()
Out[7]: 
   x
0  1
0  1

166

answered Sep 16 '22 17:09

MRocklin

Related questions
                            
                                Matplotlib Backend Differences between Agg and Cairo
                            
                                How can I make setup tools install a github forked PyPI package?
                            
                                Using multiple versions of Python
                            
                                How to get the first value in a python dictionary
                            
                                Pros and cons of 'script' vs. 'entry_point' in Python command line scripts
                            
                                Using Tweepy to listen to stream and search for tweets. How to stop previous search and only listen for new stream?
                            
                                Can we use apps.py for application-level configuration as a contrast to settings.py for project-level configurations?
                            
                                Multiple sessions and graphs in Tensorflow (in the same process)
                            
                                pyGame full core usage in simple loop
                            
                                Conda: Choose where packages are downloaded
                            
                                Understanding Pycharm's profiler's results vs. cProfile results and how to get more detail on standard library functions
                            
                                Training a tf.keras model with a basic low-level TensorFlow training loop doesn't work
                            
                                How to efficiently use asyncio when calling a method on a BaseProxy?
                            
                                PyQt vs PySide comparison [closed]
                            
                                How to delete a record from table?
                            
                                What are some good ways of estimating 'approximate' semantic similarity between sentences?
                            
                                Define remote interpreter on remote Linux machine using Pydev and RSE Server
                            
                                Jinja2: How to use named blocks inside included templates, inside extendable template
                            
                                How to perform a chi-squared goodness of fit test using scientific libraries in Python?
                            
                                Compute the gradient of the SVM loss function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With