How to remove duplicate items during training CNN?

Question

I'm working on image classification problem by using CNN. I have an image data set which contains duplicated images. when I train the CNN with this data, it has over fitting. Therefore, I need to remove those duplicates.

nav · Accepted Answer

What we loosely refer to as duplicates can be difficult for algorithms to discern. Your duplicates can be either:

Exact Duplicates
Near-exact Duplicates. (minor edits of image etc)
perceptual Duplicates (same content, but different view, camera etc)

No1 & 2 are easier to solve. No 3. is very subjective and still a research topic. I can offer a solution for No1 & 2. Both solutions use the excellent image hash- hashing library: https://github.com/JohannesBuchner/imagehash

Exact duplicates Exact duplicates can be found using a perceptual hashing measure. The phash library is quite good at this. I routinely use it to clean training data. Usage (from github site) is as simple as:

from PIL import Image
import imagehash

# image_fns : List of training image files
img_hashes = {}

for img_fn in sorted(image_fns):
    hash = imagehash.average_hash(Image.open(image_fn))
    if hash in img_hashes:
        print( '{} duplicate of {}'.format(image_fn, img_hashes[hash]) )
    else:
        img_hashes[hash] = image_fn

Near-Exact Duplicates In this case you will have to set a threshold and compare the hash values for their distance from each other. This has to be done by trial-and-error for your image content.

from PIL import Image
import imagehash

# image_fns : List of training image files
img_hashes = {}
epsilon = 50

for img_fn1, img_fn2 in zip(image_fns, image_fns[::-1]):
    if image_fn1 == image_fn2:
        continue

    hash1 = imagehash.average_hash(Image.open(image_fn1))
    hash2 = imagehash.average_hash(Image.open(image_fn2))
    if hash1 - hash2 < epsilon:
        print( '{} is near duplicate of {}'.format(image_fn1, image_fn2) )

Vinson Ciawandy · Answer

the solution from @nav is quite good for finding Near-exact Duplicates and Exact Duplicates.
Since your use case is to train a neural network and similar images cause your evaluation to overfitting, then it might be wiser to remove any kind of similarity.

I find this project to do image deduplication https://github.com/idealo/imagededup

and with CNN algorithm in the project, you can remove cases of perceptual Duplicates (which also remove near-exact and exact duplicates)

How to remove duplicate items during training CNN?

Tags:

python

image-processing

deep-learning

keras

conv-neural-network

Yidne

2 Answers

nav

Vinson Ciawandy

Recent Activity

Donate For Us

How to remove duplicate items during training CNN?

Tags:

python

image-processing

deep-learning

keras

conv-neural-network

Yidne

2 Answers

nav

Vinson Ciawandy

Related questions

Recent Activity

Donate For Us