I can split my dataset into Train and Test split with 80%:20% ratio using:
from datasets import load_dataset
ds = load_dataset("myusername/mycorpus")
ds = ds["train"].train_test_split(test_size=0.2) # my data in HF have 1 train split only
print(ds)
which outputs:
DatasetDict({
train: Dataset({
features: ['translation'],
num_rows: 62044
})
test: Dataset({
features: ['translation'],
num_rows: 15512
})
})
How can I generate the validation split, with ratio 80%:10%:10%?
from datasets import *
ds = load_dataset("myusername/mycorpus")
train_testvalid = ds['train'].train_test_split(test_size=0.2)
# Split the 10% test + valid in half test, half valid
test_valid = train_testvalid['test'].train_test_split(test_size=0.5)
# gather everyone if you want to have a single DatasetDict
ds = DatasetDict({
'train': train_testvalid['train'],
'test': test_valid['test'],
'valid': test_valid['train']})
that will output a dataset with a following stuctre
DatasetDict({
train: Dataset({
features: ['translation'],
num_rows: 62044
})
test: Dataset({
features: ['translation'],
num_rows: 7756
})
valid: Dataset({
features: ['translation'],
num_rows: 7756
})
})
hope thats help you
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With