Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compare and count the sparse arrays in a list in python

Holla!

I have a list of 60 large-size 2d arrays (30000,30000).

The goal is to compare each array with every other array and count the total number of exactly the same arrays in the entire list.

I am working on this logic, however, it is counting the number of same arrays individually and not what I want:

import numpy as np
import pandas as pd
import scipy.sparse as sp

## I am using this dummy setup, to begin with (rather than the large data)

# creating 4 dummy arrays
a = np.zeros((6,6))
a[1,2] = 1
a[2,5] = 1
a[3,2] = 1
a[4,1] = 1
print(a)
b = np.zeros((6,6))
b[1,2] = 1
b[2,5] = 1
b[3,2] = 1
b[4,1] = 1

c = np.zeros((6,6))
c[1,3] = 1
c[2,5] = 1
d = np.zeros((6,6))
d[1,3] = 1
d[2,4] = 1

# storing the arrays in a list
list2d = [a,b,c,d]

#loop through the list to count the number of arrays with exactly same values
n = len(list2d)
for i in range(n):
    count = 0
    for j in range(n):
        if (list2d[i] == list2d[j]).all() and i != j:
            count += 1
            print('list2d[',i,'] is the same as list2d[',j,']')
        else:
            print('list2d[',i,'] is not the same as list2d[',j,']')

print('total number of same arrays || count = ',count)

Another option is working with sparse matrices and storing them in a list. However, I'm not sure whether we can compare or check for equity on the entire list with 60 sparse arrays.

# again finalizing a logic on a dummy setup
a_sparse = sp.csr_matrix(a)
b_sparse = sp.csr_matrix(b)
c_sparse = sp.csr_matrix(c)
d_sparse = sp.csr_matrix(d)
print(a_sparse)
# #list of sparse matrices
list_sparse = [a_sparse,b_sparse,c_sparse,d_sparse]
## compare the list of sparse arrays and count the total number of exactly same arrays
## also, print/ store all the equal arrays 

Any suggestions and/or feedback for getting the correct logic is appreciated. Cheers!

like image 533
Y. Pat Avatar asked Mar 03 '26 17:03

Y. Pat


2 Answers

I would probably not choose to fiddle with array stuff in weird ways. I definitely would not cast sparse matrices to dense for this, as doing that directly for 60 of these 30k x 30k things would require 450GB of memory or so. Just check everything as it is.

Set up the problem (so that there are 40 unique matrices and 20 matrices which are not unique), and use Counters instead of reinventing that wheel:

from collections import Counter
from scipy import sparse
import numpy as np

list_of_arrays = [sparse.rand(200,200,density=np.random.uniform(0.025, 0.075),format='csr') for _ in range(50)]

for i in range(10):
    list_of_arrays.append(list_of_arrays[i])

Exclude any matrices which have unique shapes or unique nnz (as they're trivial to check):

# Check NNZ
nnz_counter = Counter([x.nnz for x in list_of_arrays])
non_unique_arrays = [x for x in list_of_arrays if nnz_counter[x.nnz] > 1]

# Check Shape
shape_counter = Counter([x.shape for x in non_unique_arrays])
non_unique_arrays = [x for x in non_unique_arrays if shape_counter[x.shape] > 1]

Use numpy array views + hashing to compare arrays to find identical arrays (this returns a list of True if the array has a duplicate and False otherwise).

# Check a list of arrays for duplicates by hashing
def array_hash(arrays):
    return [hash(x.view) for x in arrays]

def array_hash_duplicates(arrays):
    hashes = array_hash(arrays)
    hash_counter = Counter(hashes)
    return [True if hash_counter[x] > 1 else False for x in hashes]

Now apply that check to the matrix indptr, indices, and data arrays in order, removing any matrices which are unique after each check.

# Check indptr, indices, and data in order
non_unique_arrays = [
    x
    for x, y in zip(
        non_unique_arrays,
        array_hash_duplicates([x.indptr for x in non_unique_arrays])
    )
    if y
]

non_unique_arrays = [
    x
    for x, y in zip(
        non_unique_arrays,
        array_hash_duplicates([x.indices for x in non_unique_arrays])
    )
    if y
]

duplicates = Counter(array_hash([x.data for x in non_unique_arrays]))
n_duplicates = sum(x - 1 for x in duplicates.values())

>>> n_duplicates
10

This results in a list of matrices which are non-unique (so at least one other matrix is identical in the list). It's possible to have multiple non-unique matrices which are not the same, of course.

Note that this is inefficient if you expect the list to be duplicates of the same python object, not just different matrices with the same values. That would be easy to solve another way.

like image 145
CJR Avatar answered Mar 06 '26 06:03

CJR


EDIT#3:

Based on your comments, I think this is what you are trying to do.

import numpy as np
from copy import deepcopy

def convert_to_tuple(mat):
    x = tuple(np.flatnonzero(mat)) + mat.shape
    return (x)

def get_replicates(id, mat, mat_list):
    replicates = 0
    
    #Remove the relevant matrix from mat_list to avoid checking the reference against itself
    del mat_list[id]
    
    # Create a tuple of the reference matrix
    ref = convert_to_tuple(mat)
    print(id, ":",  ref)
    
    # Check how many replicates of the reference matrix there are
    for m in mat_list:
        s = set()
        s.add(convert_to_tuple(m))
        s.add(ref)
        replicates += (-len(s)+2)
    
    # Replace the matrix into mat_list
    mat_list.insert(id, mat)
    
    return replicates    
    

# Generate a number of sparse matrices
# a=b
# c=d=e=f=g
# ----------------------------------------------------------------------------
a = np.zeros((6,6))
a[1,2] = 1
a[2,5] = 1
a[3,2] = 1
a[4,1] = 1

b = deepcopy(a)

c = np.zeros((6,6))
c[1,3] = 1
c[2,5] = 1

d = deepcopy(c)
e = deepcopy(c)
f = deepcopy(c)
g = deepcopy(c)


# storing the arrays in a list
list2d = [a,b,c,d,e,f,g]


# Identify the number of replicates
#------------------------------------------------------------------------------
number_of_replicates  = [get_replicates(i, arr, list2d) for i, arr in enumerate(list2d)]
    

# Print the number of replicates 
#------------------------------------------------------------------------------
for i, reps in enumerate(number_of_replicates):
    print(f"Sparse Array {i} has {reps} replicates")

OUTPUT:

0 : (8, 17, 20, 25, 6, 6)
1 : (8, 17, 20, 25, 6, 6)
2 : (9, 17, 6, 6)
3 : (9, 17, 6, 6)
4 : (9, 17, 6, 6)
5 : (9, 17, 6, 6)
6 : (9, 17, 6, 6)

Sparse Array 0 has 1 replicates
Sparse Array 1 has 1 replicates
Sparse Array 2 has 4 replicates
Sparse Array 3 has 4 replicates
Sparse Array 4 has 4 replicates
Sparse Array 5 has 4 replicates
Sparse Array 6 has 4 replicates

The top part of the output shows the what each matrix looks like after being converted to a tuple. The tuple contains the index of each 1 within the matrix, and the shape of each matrix 6,6 is appended to the end.

From the output you can see that:

array a and b - have 1 replicate each
arrays c,d,e,f,g - have 4 replicates each
like image 29
ScottC Avatar answered Mar 06 '26 06:03

ScottC