Split image dataset into train-test datasets

2 Answers

If you are not too keen on coding, there is a python package called split-folders that you could use. It is extremely easy to use and can be found here Here is how it can be used.

pip install split-folders
import split_folders # or import splitfolders
input_folder = "/path/to/input/folder"
output = "/path/to/output/folder" #where you want the split datasets saved. one will be created if it does not exist or none is set

split_folders.ratio(input_folder, output=output, seed=42, ratio=(.8, .1, .1)) # ratio of split are in order of train/val/test. You can change to whatever you want. For train/val sets only, you could do .75, .25 for example.

However, I strongly recommend coding answers presented above because they help you learn.

170

answered Jan 01 '23 22:01

Misan

This should do it. It will calculate how many images are in each folder and then splits them accordingly, saving test data in a different folder with the same structure. Save the code in main.py file and run command:

python3 main.py ----data_path=/path1 --test_data_path_to_save=/path2 --train_ratio=0.7

import shutil
import os
import numpy as np
import argparse

def get_files_from_folder(path):

    files = os.listdir(path)
    return np.asarray(files)

def main(path_to_data, path_to_test_data, train_ratio):
    # get dirs
    _, dirs, _ = next(os.walk(path_to_data))

    # calculates how many train data per class
    data_counter_per_class = np.zeros((len(dirs)))
    for i in range(len(dirs)):
        path = os.path.join(path_to_data, dirs[i])
        files = get_files_from_folder(path)
        data_counter_per_class[i] = len(files)
    test_counter = np.round(data_counter_per_class * (1 - train_ratio))

    # transfers files
    for i in range(len(dirs)):
        path_to_original = os.path.join(path_to_data, dirs[i])
        path_to_save = os.path.join(path_to_test_data, dirs[i])

        #creates dir
        if not os.path.exists(path_to_save):
            os.makedirs(path_to_save)
        files = get_files_from_folder(path_to_original)
        # moves data
        for j in range(int(test_counter[i])):
            dst = os.path.join(path_to_save, files[j])
            src = os.path.join(path_to_original, files[j])
            shutil.move(src, dst)


def parse_args():
  parser = argparse.ArgumentParser(description="Dataset divider")
  parser.add_argument("--data_path", required=True,
    help="Path to data")
  parser.add_argument("--test_data_path_to_save", required=True,
    help="Path to test data where to save")
  parser.add_argument("--train_ratio", required=True,
    help="Train ratio - 0.7 means splitting data in 70 % train and 30 % test")
  return parser.parse_args()

if __name__ == "__main__":
  args = parse_args()
  main(args.data_path, args.test_data_path_to_save, float(args.train_ratio))

answered Jan 01 '23 23:01

nomow

Related questions
                            
                                How to run a coroutine and wait it result from a sync func when the loop is running?
                            
                                Await a method and assign a variable to the returned value with asyncio?
                            
                                Google Cloud Functions - Why are two positional arguments passed into my function by GCF?
                            
                                how python interpreter treats the position of the function definition having default parameter
                            
                                has Python 3 been widely adopted yet?
                            
                                Python 2 and 3 csv reader
                            
                                What determines debugger run-time performance
                            
                                What is the fastest way to check if a string contains repeating characters in Python 3?
                            
                                Can't print character '\u2019' in Python from JSON object
                            
                                Python importlib's analogue for imp.new_module()
                            
                                Python 3 multiprocessing: optimal chunk size
                            
                                Combination of values in pandas data frame
                            
                                Getting a random sample in Python dataframe by category
                            
                                Can zappa be used to run functions directly (non wsgi apps)
                            
                                Python equivalent of R list()
                            
                                How do I make my discord.py bot use custom emoji?
                            
                                Get all possible str partitions of any length
                            
                                can not import Protocol from typing
                            
                                ModuleNotFoundError: No module named 'unidecode' yet I have the module installed
                            
                                Can't import bert.tokenization

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Split image dataset into train-test datasets

Tags:

python-3.x

training-data

train-test-split

Ishan Dixit

People also ask

2 Answers

Misan

nomow

Recent Activity

Donate For Us