SHA Hashing for training/validation/testing set split

Tags:

Following is a small snippet from the full code

I am trying to understand the logical process of this methodology of split.

SHA1 encoding is 40 characters in hexadecimal. What kind of probability has been computed in the expression ?
What is the reason for (MAX_NUM_IMAGES_PER_CLASS + 1) ? Why add 1 ?
Does setting different values to MAX_NUM_IMAGES_PER_CLASS have an effect on the split quality ?

How good a quality of split would we get out of this ? Is this is a recommended way of splitting datasets ?

# We want to ignore anything after '_nohash_' in the file name when
  # deciding which set to put an image in, the data set creator has a way of
  # grouping photos that are close variations of each other. For example
  # this is used in the plant disease data set to group multiple pictures of
  # the same leaf.
  hash_name = re.sub(r'_nohash_.*$', '', file_name)
  # This looks a bit magical, but we need to decide whether this file should
  # go into the training, testing, or validation sets, and we want to keep
  # existing files in the same set even if more files are subsequently
  # added.
  # To do that, we need a stable way of deciding based on just the file name
  # itself, so we do a hash of that and then use that to generate a
  # probability value that we use to assign it.
  hash_name_hashed = hashlib.sha1(compat.as_bytes(hash_name)).hexdigest()
  percentage_hash = ((int(hash_name_hashed, 16) %
                      (MAX_NUM_IMAGES_PER_CLASS + 1)) *
                     (100.0 / MAX_NUM_IMAGES_PER_CLASS))
  if percentage_hash < validation_percentage:
    validation_images.append(base_name)
  elif percentage_hash < (testing_percentage + validation_percentage):
    testing_images.append(base_name)
  else:
    training_images.append(base_name)

  result[label_name] = {
      'dir': dir_name,
      'training': training_images,
      'testing': testing_images,
      'validation': validation_images,
      }

294

asked Jan 31 '17 10:01

Ujjwal

1 Answers

This code is simply distributing file names “randomly” (but reproducibly) over a number of bins and then grouping the bins into just the three categories. The number of bits in the hash is irrelevant (so long as it’s “enough”, which is probably about 35 for this sort of work).

Reducing modulo n+1 produces a value on [0,n], and multiplying that by 100/n obviously produces a value on [0,100], which is being interpreted as a percentage. n being MAX_NUM_IMAGES_PER_CLASS is meant to control the rounding error in the interpretation to be no more than “one image”.

This strategy is reasonable, but looks a bit more sophisticated than it is (since there is still rounding going on, and the remainder introduces a bias—although with numbers this large it is utterly unobservable). You could make it simpler and more accurate by simply precalculating ranges over the whole space of 2^160 hashes for each class and just checking the hash against the two boundaries. That still notionally involves rounding, but with 160 bits it’s only that intrinsic to representing decimals like 31% in floating point.

199

answered Sep 18 '22 06:09

Davis Herring

Related questions
                            
                                How to implement Weighted Binary CrossEntropy on theano?
                            
                                How do you control user access to records in a key-value database?
                            
                                Python scrapy ReactorNotRestartable substitute
                            
                                Python break from if statement to else
                            
                                How to disable sorting by primary key in Django Admin?
                            
                                Python Bokeh table columns and headers don't line up
                            
                                How do I implement the Triplet Loss in Keras?
                            
                                Pandas returning empty groups in groupby
                            
                                Counting bigrams real fast (with or without multiprocessing) - python
                            
                                Converting Pandas DataFrame to Spark DataFrame
                            
                                Logging raw queries Generated by MongoEngine
                            
                                opencv simpleblobdetector - get blob attributes for identified blobs
                            
                                N-Queens Symmetry Breaking Google OR Tools
                            
                                __init__.py required for pkgutil.walk_packages in python3?
                            
                                calling pytest from inside python code
                            
                                Python 3.6.0 implicit namespace package
                            
                                set sender name and last name when sending mail with Gmail API
                            
                                Is it an acceptable pattern to define a Class inside a Function?
                            
                                Python, Pandas: GroupBy attributes documentation
                            
                                tensorflow difference between saving model via exporter and tf.train.write_graph()?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

SHA Hashing for training/validation/testing set split

Tags:

python

machine-learning

tensorflow

sha

Ujjwal

People also ask

1 Answers

Davis Herring

Recent Activity

Donate For Us