Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most efficient way to turn dictionary into symmetric/distance matrix in Pandas

I'm doing pairwise distance for something with a weird distance metric. I have a dictionary like {(key_A, key_B):distance_value} and I want to make a symmetric pd.DataFrame like a distance matrix.

What is the most efficient way to do this? I found one way but it doesn't seem like the best way to do this. Is there anything in NumPy or Pandas that does this type of operation? or just a quicker way? My way is 1.46 ms per loop

np.random.seed(0)
D_pair_value = dict()
for pair in itertools.combinations(list("ABCD"),2):
    D_pair_value[pair] = np.random.randint(0,5)
D_pair_value
# {('A', 'B'): 4,
#  ('A', 'C'): 0,
#  ('A', 'D'): 3,
#  ('B', 'C'): 3,
#  ('B', 'D'): 3,
#  ('C', 'D'): 1}
D_nested_dict = defaultdict(dict)
for (p,q), value in D_pair_value.items():
    D_nested_dict[p][q] = value
    D_nested_dict[q][p] = value

# Fill diagonal with zeros
DF = pd.DataFrame(D_nested_dict)
np.fill_diagonal(DF.values, 0)
DF

enter image description here

like image 975
O.rka Avatar asked Aug 17 '16 18:08

O.rka


1 Answers

You can use scipy.spatial.distance.squareform, which converts a vector of distance computations, i.e. [d(A,B), d(A,C), ..., d(C,D)], into the distance matrix you're looking for.

Method 1: Distances Stored in a List

If you're computing your distances in order, like in your example code and in my example distance vector, I'd avoid using a dictionary and just store the results in a list, and do something like:

from scipy.spatial.distance import squareform

df = pd.DataFrame(squareform(dist_list), index=list('ABCD'), columns=list('ABCD'))

Method 2: Distances Stored in a Dictionary

If you're computing things out of order and a dictionary is required, you just need to get a distance vector that's properly sorted:

from scipy.spatial.distance import squareform

dist_list = [dist[1] for dist in sorted(D_pair_value.items())]
df = pd.DataFrame(squareform(dist_list), index=list('ABCD'), columns=list('ABCD'))

Method 3: Distances Stored in a Sorted Dictionary

If a dictionary is required, note that there's a package called sortedcontainers which has a SortedDict that essentially would solve the sorting issue for you. To use it, all you'd need to change is initializing D_pair_value as a SortedDict() instead of a dict. Using your example setup:

from scipy.spatial.distance import squareform
from sortedcontainers import SortedDict

np.random.seed(0)
D_pair_value = SortedDict()
for pair in itertools.combinations(list("ABCD"),2):
    D_pair_value[pair] = np.random.randint(0,5)

df = pd.DataFrame(squareform(D_pair_value.values()), index=list('ABCD'), columns=list('ABCD'))

The Resulting Output for Any Method Above:

     A    B    C    D
A  0.0  4.0  0.0  3.0
B  4.0  0.0  3.0  3.0
C  0.0  3.0  0.0  1.0
D  3.0  3.0  1.0  0.0
like image 125
root Avatar answered Nov 14 '22 17:11

root