I'm doing pairwise distance for something with a weird distance metric. I have a dictionary like {(key_A, key_B):distance_value}
and I want to make a symmetric pd.DataFrame
like a distance matrix.
What is the most efficient way to do this? I found one way but it doesn't seem like the best way to do this. Is there anything in NumPy
or Pandas
that does this type of operation? or just a quicker way? My way is 1.46 ms per loop
np.random.seed(0)
D_pair_value = dict()
for pair in itertools.combinations(list("ABCD"),2):
D_pair_value[pair] = np.random.randint(0,5)
D_pair_value
# {('A', 'B'): 4,
# ('A', 'C'): 0,
# ('A', 'D'): 3,
# ('B', 'C'): 3,
# ('B', 'D'): 3,
# ('C', 'D'): 1}
D_nested_dict = defaultdict(dict)
for (p,q), value in D_pair_value.items():
D_nested_dict[p][q] = value
D_nested_dict[q][p] = value
# Fill diagonal with zeros
DF = pd.DataFrame(D_nested_dict)
np.fill_diagonal(DF.values, 0)
DF
You can use scipy.spatial.distance.squareform
, which converts a vector of distance computations, i.e. [d(A,B), d(A,C), ..., d(C,D)]
, into the distance matrix you're looking for.
Method 1: Distances Stored in a List
If you're computing your distances in order, like in your example code and in my example distance vector, I'd avoid using a dictionary and just store the results in a list, and do something like:
from scipy.spatial.distance import squareform
df = pd.DataFrame(squareform(dist_list), index=list('ABCD'), columns=list('ABCD'))
Method 2: Distances Stored in a Dictionary
If you're computing things out of order and a dictionary is required, you just need to get a distance vector that's properly sorted:
from scipy.spatial.distance import squareform
dist_list = [dist[1] for dist in sorted(D_pair_value.items())]
df = pd.DataFrame(squareform(dist_list), index=list('ABCD'), columns=list('ABCD'))
Method 3: Distances Stored in a Sorted Dictionary
If a dictionary is required, note that there's a package called sortedcontainers
which has a SortedDict
that essentially would solve the sorting issue for you. To use it, all you'd need to change is initializing D_pair_value
as a SortedDict()
instead of a dict
. Using your example setup:
from scipy.spatial.distance import squareform
from sortedcontainers import SortedDict
np.random.seed(0)
D_pair_value = SortedDict()
for pair in itertools.combinations(list("ABCD"),2):
D_pair_value[pair] = np.random.randint(0,5)
df = pd.DataFrame(squareform(D_pair_value.values()), index=list('ABCD'), columns=list('ABCD'))
The Resulting Output for Any Method Above:
A B C D
A 0.0 4.0 0.0 3.0
B 4.0 0.0 3.0 3.0
C 0.0 3.0 0.0 1.0
D 3.0 3.0 1.0 0.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With