Manipulating a large dataframe most efficiently

Question

Imagine I have this dataframe called temp:

temp = pd.DataFrame(index = [x for x in range(0, 10)], columns = list('abcd'))
for row in temp.index:
        temp.loc[row] = default_rng().choice(10, size=4, replace=False) 

temp.loc[1, 'b'] = np.nan
temp.loc[3, 'd'] = np.nan

df:

enter image description here

The values are the same nature as the indices. My goal is to create an adjacency matrix where the indices and columns are temp.index, where the matrix shows what values have appeared in each index's row.

What I have done:

temp2 = pd.DataFrame(index = temp.index, columns = temp.index)
for index in temp.index:  
    temp2.loc[index, temp.loc[index].dropna().values] = 1

temp2 = temp2.replace(np.nan, 0)

temp2:

enter image description here

This does the job: for example, temp2 shows that row index 0 is adjacent to indices 4,5,7, and 8. In other words, indices that existed in row 0 in temp have a value of 1 and others have a value of 0 in temp2.

Problem: There are 132K indices in the real temp and creating temp2 throws out a memory error. What is the most efficient way of getting to temp2. FWIW, the indices are range(132000). Also, I'm going to later convert this matrix to a Torch tensor of dimensions (2, number of edges) that shows the same adjacency info:

adj = torch.tensor(temp2.values)
edge_index = adj.nonzero().t().contiguous()

mozway · Accepted Answer

First of all, the pandas approach to create the output would be a crosstab:

s = temp.stack()
out = (pd.crosstab(s.index.get_level_values(0), s.values)
         .rename_axis(index=None, columns=None)
      )

Output:

   0  1  2  3  4  5  6  7  8  9
0  0  0  0  0  1  1  0  1  1  0
1  0  0  0  0  1  0  0  0  1  1
2  0  1  0  0  1  0  0  1  0  1
3  1  1  0  0  0  0  0  1  0  0
4  0  0  1  0  1  0  0  1  0  1
5  0  0  1  0  0  0  1  0  1  1
6  0  1  0  1  1  0  0  1  0  0
7  1  0  0  1  0  1  1  0  0  0
8  1  1  0  0  0  0  0  1  1  0
9  0  1  0  0  0  1  1  0  0  1

However, if you goal is to create a tensor of shape (2, number_of_edges), why create an intermediate square DataFrame?

Directly create the desired tensor:

import torch

idx = s.index.get_level_values(0)
coord = torch.tensor([idx, s.values], dtype=torch.int32)

Output coord:

tensor([[0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6,
         6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9],
        [7, 4, 8, 5, 4, 8, 9, 1, 4, 9, 7, 0, 7, 1, 9, 2, 4, 7, 8, 9, 2, 6, 1, 3,
         7, 4, 0, 5, 6, 3, 1, 0, 8, 7, 6, 5, 1, 9]], dtype=torch.int32)

And if you want, you can create a sparse square tensor with sparse_coo_tensor:

out = torch.sparse_coo_tensor(coord, torch.ones(len(s)))

NB. if you have duplicate coordinates in an input row, you further need to coalesce.

Output:

tensor(indices=tensor([[0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5,
                        5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9],
                       [7, 4, 8, 5, 4, 8, 9, 1, 4, 9, 7, 0, 7, 1, 9, 2, 4, 7, 8,
                        9, 2, 6, 1, 3, 7, 4, 0, 5, 6, 3, 1, 0, 8, 7, 6, 5, 1, 9]]),
       values=tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
                      1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
                      1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
       size=(10, 10), nnz=38, layout=torch.sparse_coo)

Manipulating a large dataframe most efficiently

Tags:

python

pandas

dataframe

numpy

pytorch

Saeed

1 Answers

mozway

Recent Activity

Donate For Us

Manipulating a large dataframe most efficiently

Tags:

python

pandas

dataframe

numpy

pytorch

Saeed

1 Answers

mozway

Related questions

Recent Activity

Donate For Us