Say i have a dataframe with 100,000 entries and want to split it into 100 sections of 1000 entries.
How do i take a random sample of say size 50 of just one of the 100 sections. the data set is already ordered such that the first 1000 results are the first section the next section the next and so on.
many thanks
The easiest way to randomly select rows from a Pandas dataframe is to use the sample() method. For example, if your dataframe is called “df”, df. sample(n=250) will result in that 200 rows were selected randomly. Note, removing the n parameter will result in one random row instead of multiple rows.
You can use the sample
method*:
In [11]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6], [7, 8]], columns=["A", "B"]) In [12]: df.sample(2) Out[12]: A B 0 1 2 2 5 6 In [13]: df.sample(2) Out[13]: A B 3 7 8 0 1 2
*On one of the section DataFrames.
Note: If you have a larger sample size that the size of the DataFrame this will raise an error unless you sample with replacement.
In [14]: df.sample(5) ValueError: Cannot take a larger sample than population when 'replace=False' In [15]: df.sample(5, replace=True) Out[15]: A B 0 1 2 1 3 4 2 5 6 3 7 8 1 3 4
One solution is to use the choice
function from numpy.
Say you want 50 entries out of 100, you can use:
import numpy as np chosen_idx = np.random.choice(1000, replace=False, size=50) df_trimmed = df.iloc[chosen_idx]
This is of course not considering your block structure. If you want a 50 item sample from block i
for example, you can do:
import numpy as np block_start_idx = 1000 * i chosen_idx = np.random.choice(1000, replace=False, size=50) df_trimmed_from_block_i = df.iloc[block_start_idx + chosen_idx]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With