Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Image clustering by its similarity in python

I have a collection of photos and I'd like to distinguish clusters of the similar photos. Which features of an image and which algorithm should I use to solve my task?

like image 491
alex Avatar asked Aug 24 '16 12:08

alex


People also ask

How do you cluster images in python?

Generally speaking you can use any clustering mechanism, e.g. a popular k-means. To prepare your data for clustering you need to convert your collection into an array X, where every row is one example (image) and every column is a feature. The main question - what your features should be.

What is image similarity?

Image similarity is the measure of how similar two images are. In other words, it quantifies the degree of similarity between intensity patterns in two images.

Can we cluster images?

Clustering of images is a multi-step process for which the steps are to pre-process the images, extract the features, cluster the images on similarity, and evaluate for the optimal number of clusters using a measure of goodness.


2 Answers

I had the same problem and I came up with this solution:

  1. Import a pretrained model using Keras (here VGG16)
  2. Extract features per image
  3. Do kmeans
  4. Export by copying with cluster label

Here is my code, partly motivated by this post.

from keras.preprocessing import image
from keras.applications.vgg16 import VGG16
from keras.applications.vgg16 import preprocess_input
import numpy as np
from sklearn.cluster import KMeans
import os, shutil, glob, os.path
from PIL import Image as pil_image
image.LOAD_TRUNCATED_IMAGES = True 
model = VGG16(weights='imagenet', include_top=False)

# Variables
imdir = 'C:/indir/'
targetdir = "C:/outdir/"
number_clusters = 3

# Loop over files and get features
filelist = glob.glob(os.path.join(imdir, '*.jpg'))
filelist.sort()
featurelist = []
for i, imagepath in enumerate(filelist):
    print("    Status: %s / %s" %(i, len(filelist)), end="\r")
    img = image.load_img(imagepath, target_size=(224, 224))
    img_data = image.img_to_array(img)
    img_data = np.expand_dims(img_data, axis=0)
    img_data = preprocess_input(img_data)
    features = np.array(model.predict(img_data))
    featurelist.append(features.flatten())

# Clustering
kmeans = KMeans(n_clusters=number_clusters, random_state=0).fit(np.array(featurelist))

# Copy images renamed by cluster 
# Check if target dir exists
try:
    os.makedirs(targetdir)
except OSError:
    pass
# Copy with cluster name
print("\n")
for i, m in enumerate(kmeans.labels_):
    print("    Copy: %s / %s" %(i, len(kmeans.labels_)), end="\r")
    shutil.copy(filelist[i], targetdir + str(m) + "_" + str(i) + ".jpg")

Update 02/2022:

In some cases (e.g. unknown number of clusters) using Affinity Propagation may be a much better choice than kmeans. In this case, replace kmeans by:

from sklearn.cluster import AffinityPropagation
affprop = AffinityPropagation(affinity="euclidean", damping=0.5).fit(np.array(featurelist))

and loop over affprop.labels_ to access the results.

like image 131
Peter Avatar answered Nov 15 '22 17:11

Peter


It is a too broad question.

Generally speaking you can use any clustering mechanism, e.g. a popular k-means. To prepare your data for clustering you need to convert your collection into an array X, where every row is one example (image) and every column is a feature.

The main question - what your features should be. It is difficult to answer without knowing what you are trying to accomplish. If your images are small and of the same size you can simply have every pixel as a feature. If you have any metadata and would like to sort using it - you can have every tag in metadata as a feature.

Now if you really need to find some patterns between images you will have to apply an additional layer of processing, like convolutional neural network, which essentially allows you to extract features from different pieces of your image. You can think about it as a filter, which will convert every image into, say 8x8 matrix, which then correspondingly could be used as a row with 64 different features in your array X for clustering.

like image 38
omdv Avatar answered Nov 15 '22 16:11

omdv