Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Manipulating a large dataframe most efficiently

Imagine I have this dataframe called temp:

temp = pd.DataFrame(index = [x for x in range(0, 10)], columns = list('abcd'))
for row in temp.index:
        temp.loc[row] = default_rng().choice(10, size=4, replace=False) 

temp.loc[1, 'b'] = np.nan
temp.loc[3, 'd'] = np.nan

df:

enter image description here

The values are the same nature as the indices. My goal is to create an adjacency matrix where the indices and columns are temp.index, where the matrix shows what values have appeared in each index's row.

What I have done:

temp2 = pd.DataFrame(index = temp.index, columns = temp.index)
for index in temp.index:  
    temp2.loc[index, temp.loc[index].dropna().values] = 1

temp2 = temp2.replace(np.nan, 0)

temp2:

enter image description here

This does the job: for example, temp2 shows that row index 0 is adjacent to indices 4,5,7, and 8. In other words, indices that existed in row 0 in temp have a value of 1 and others have a value of 0 in temp2.

Problem: There are 132K indices in the real temp and creating temp2 throws out a memory error. What is the most efficient way of getting to temp2. FWIW, the indices are range(132000). Also, I'm going to later convert this matrix to a Torch tensor of dimensions (2, number of edges) that shows the same adjacency info:

adj = torch.tensor(temp2.values)
edge_index = adj.nonzero().t().contiguous()
like image 423
Saeed Avatar asked May 24 '26 15:05

Saeed


1 Answers

First of all, the pandas approach to create the output would be a crosstab:

s = temp.stack()
out = (pd.crosstab(s.index.get_level_values(0), s.values)
         .rename_axis(index=None, columns=None)
      )

Output:

   0  1  2  3  4  5  6  7  8  9
0  0  0  0  0  1  1  0  1  1  0
1  0  0  0  0  1  0  0  0  1  1
2  0  1  0  0  1  0  0  1  0  1
3  1  1  0  0  0  0  0  1  0  0
4  0  0  1  0  1  0  0  1  0  1
5  0  0  1  0  0  0  1  0  1  1
6  0  1  0  1  1  0  0  1  0  0
7  1  0  0  1  0  1  1  0  0  0
8  1  1  0  0  0  0  0  1  1  0
9  0  1  0  0  0  1  1  0  0  1

However, if you goal is to create a tensor of shape (2, number_of_edges), why create an intermediate square DataFrame?

Directly create the desired tensor:

import torch

idx = s.index.get_level_values(0)
coord = torch.tensor([idx, s.values], dtype=torch.int32)

Output coord:

tensor([[0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6,
         6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9],
        [7, 4, 8, 5, 4, 8, 9, 1, 4, 9, 7, 0, 7, 1, 9, 2, 4, 7, 8, 9, 2, 6, 1, 3,
         7, 4, 0, 5, 6, 3, 1, 0, 8, 7, 6, 5, 1, 9]], dtype=torch.int32)

And if you want, you can create a sparse square tensor with sparse_coo_tensor:

out = torch.sparse_coo_tensor(coord, torch.ones(len(s)))

NB. if you have duplicate coordinates in an input row, you further need to coalesce.

Output:

tensor(indices=tensor([[0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5,
                        5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9],
                       [7, 4, 8, 5, 4, 8, 9, 1, 4, 9, 7, 0, 7, 1, 9, 2, 4, 7, 8,
                        9, 2, 6, 1, 3, 7, 4, 0, 5, 6, 3, 1, 0, 8, 7, 6, 5, 1, 9]]),
       values=tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
                      1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
                      1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
       size=(10, 10), nnz=38, layout=torch.sparse_coo)
like image 54
mozway Avatar answered May 27 '26 03:05

mozway



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!