I have a pandas dataframe of pairwise distances in the form of:
SampleA SampleB Num_Differences
0 sample_1 sample_2 1
1 sample_1 sample_3 4
2 sample_2 sample_3 8
Note that there are no self-self comparisons (e.g., sample_1 vs sample_1 won't be represented). I would like to convert this table into a squareform distance matrix instead, like so:
sample_1 sample_2 sample_3
sample_1 1 4
sample_2 1 8
sample_3 4 8
Can anyone give me some pointers on how to do such a conversion in python? The problem is analogous to a previous question in R (Converting pairwise distances into a distance matrix in R), but I don't know the corresponding python functions to use. The problem also appears to be the opposite of this question (Convert a distance matrix to a list of pairwise distances in Python).
Some code to reproduce a dataframe in the form I'm using:
df = pd.DataFrame([['sample_1', 'sample_2', 1],
['sample_1', 'sample_3', 4],
['sample_2', 'sample_3', 8]],
columns=['SampleA', 'SampleB', 'Num_Differences'])
You can reshape to square, and then make symmetrical by adding the transposed values:
# make unique, sorted, common index
idx = sorted(set(df['SampleA']).union(df['SampleB']))
# reshape
(df.pivot(index='SampleA', columns='SampleB', values='Num_Differences')
.reindex(index=idx, columns=idx)
.fillna(0, downcast='infer')
.pipe(lambda x: x+x.values.T)
)
Alternatively, you can use ordered categorical indexes and keep NAs during reshaping with pivot_table
. Then add the transposed values to make symmetrical:
cat = sorted(set(df['SampleA']).union(df['SampleB']))
(df.assign(SampleA=pd.Categorical(df['SampleA'],
categories=cat,
ordered=True),
SampleB=pd.Categorical(df['SampleB'],
categories=cat,
ordered=True),
)
.pivot_table(index='SampleA',
columns='SampleB',
values='Num_Differences',
dropna=False, fill_value=0)
.pipe(lambda x: x+x.values.T)
)
Output:
SampleB sample_1 sample_2 sample_3
SampleA
sample_1 0 1 4
sample_2 1 0 8
sample_3 4 8 0
We seem to be converting a weighted edgelist to an adjacency matrix. We can use networkx
functions to make this conversion from_pandas_edgelist
to adjacency_matrix
:
import networkx as nx
import pandas as pd
# Create Graph
G = nx.from_pandas_edgelist(
df,
source='SampleA',
target='SampleB',
edge_attr='Num_Differences'
)
# Build adjacency matrix
adjacency_df = pd.DataFrame(
nx.adjacency_matrix(G, weight='Num_Differences').todense(),
index=G.nodes,
columns=G.nodes
)
adjacency_df
:
sample_1 sample_2 sample_3
sample_1 0 1 4
sample_2 1 0 8
sample_3 4 8 0
We can also fill the diagonal with numpy.fill_diagonal
if wanting NaN instead of 0s:
import networkx as nx
import numpy as np
import pandas as pd
G = nx.from_pandas_edgelist(
df,
source='SampleA',
target='SampleB',
edge_attr='Num_Differences'
)
adjacency_df = pd.DataFrame(
nx.adjacency_matrix(G, weight='Num_Differences').todense(),
index=G.nodes,
columns=G.nodes,
dtype=float # Compatible dtype with NaN is needed
)
# Overwrite the values on the diagonal
np.fill_diagonal(adjacency_df.values, np.NaN)
adjacency_df
:
sample_1 sample_2 sample_3
sample_1 NaN 1.0 4.0
sample_2 1.0 NaN 8.0
sample_3 4.0 8.0 NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With