Below I have a code where a read a csv file and take a random sample of 700
from the file.
I need to do this on multiple files, but if I iterate over the files, the sample (as it is random) will be different for each file, wheras I want to keep it the same once it's randomly generated.
df = pd.read_csv(file.csv, delim_whitespace=True)
df_s = df.sample(n=700)
My ideas are to take the row number and then pass it to the next file, however this does not seem to be very elegant.
Do you know any good solutions to this issue?
CLARIFICATION
The file lengths are different, but there is a minimum file length: 750.
desired outcome EXAMPLE
df1 = pd.read_csv(file1.csv, delim_whitespace=True)
df_s1 = df1.sample(n=700) # choose random sample
df2 = pd.read_csv(file2.csv, delim_whitespace=True)
df_s2 = df2.sample(n=700) # use same random sample as above
I think you can use random_state
parameter in sample
, but it works only if same sizes of all files, so add parameter nrows
to read_csv
:
df = pd.read_csv(file.csv, delim_whitespace=True, nrows=750)
df_s = df.sample(n=700, random_state=123)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With