Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas MultiIndex lookup with Numpy arrays

I'm working with a pandas DataFrame that represents a graph. The dataframe is indexed by a MultiIndex that indicates the node endpoints.

Setup:

import pandas as pd
import numpy as np
import itertools as it
edges = list(it.combinations([1, 2, 3, 4], 2))

# Define a dataframe to represent a graph
index = pd.MultiIndex.from_tuples(edges, names=['u', 'v'])
df = pd.DataFrame.from_dict({
    'edge_id': list(range(len(edges))),
    'edge_weight': np.random.RandomState(0).rand(len(edges)),
})
df.index = index
print(df)
## -- End pasted text --
     edge_id  edge_weight
u v                      
1 2        0       0.5488
  3        1       0.7152
  4        2       0.6028
2 3        3       0.5449
  4        4       0.4237
3 4        5       0.6459

I want to be able to index into the graph using an edge subset, which is why I've chosen to use a MultiIndex. I'm able to do this just fine as long as the input to df.loc is a list of tuples.

# Select subset of graph using list-of-tuple indexing
edge_subset1 = [edges[x] for x in [0, 3, 2]]
df.loc[edge_subset1]
## -- End pasted text --
     edge_id  edge_weight
u v                      
1 2        0       0.5488
2 3        3       0.5449
1 4        2       0.6028

However, when my list of edges is a numpy array (as it often is), or a list of lists, then I seem to be unable to use the df.loc property.

# Why can't I do this if `edge_subset2` is a numpy array?
edge_subset2 = np.array(edge_subset1)
df.loc[edge_subset2]
## -- End pasted text --
TypeError: unhashable type: 'numpy.ndarray'

It would be ok if I could just all arr.tolist(), but this results in a seemingly different error.

# Why can't I do this if `edge_subset2` is a numpy array?
# or if `edge_subset3` is a list-of-lists?
edge_subset3 = edge_subset2.tolist()
df.loc[edge_subset3]
## -- End pasted text --
TypeError: '[1, 2]' is an invalid key

It's a real pain to have to use list(map(tuple, arr.tolist())) every time I want to select a subset. It would be nice if there was another way to do this.

The main questsions are:

  • Why can't I use a numpy array with .loc? Is it because under the hood a dictionary is being used to map the multi-index labels to positional indices?

  • Why does a list-of-lists give a different error? Maybe its really the same problem its just caught a different way?

  • Is there another (ideally less-verbose) way to lookup a subset of a dataframe with a numpy array of multi-index labels that I'm unaware of?

like image 485
Erotemic Avatar asked Jan 05 '17 19:01

Erotemic


People also ask

How do I convert MultiIndex to single index in pandas?

To revert the index of the dataframe from multi-index to a single index using the Pandas inbuilt function reset_index(). Returns: (Data Frame or None) DataFrame with the new index or None if inplace=True.

How convert MultiIndex to columns in pandas?

pandas MultiIndex to ColumnsUse pandas DataFrame. reset_index() function to convert/transfer MultiIndex (multi-level index) indexes to columns. The default setting for the parameter is drop=False which will keep the index values as columns and set the new index to DataFrame starting from zero.

What does the pandas function MultiIndex From_tuples do?

from_tuples() function is used to convert list of tuples to MultiIndex. It is one of the several ways in which we construct a MultiIndex.


1 Answers

A dictionary keys are immutable, that's basically why you cant use a list of lists to access multi-index.

To be able to access multi-indexed data using loc you need to convert your numpy array to a list of tuples; tuples are immutable, one way to do so is using map as you mentioned

If you want to avoid using map and you're reading the edges form a csv file, you could read them into a data frame then use to_records with the index attribute set to False, Another way could be by creating a multi-index from the ndarray but you have to transpose the list before passing it so that each level is one list in the array

import pandas as pd   

df1 = df.loc[pd.MultiIndex.from_arrays(edge_subset2.T)]


print(df1)

#outputs
          edge_id    edge_weight
------  ---------  -------------
(1, 2)          0       0.548814
(2, 3)          3       0.544883
(1, 4)          2       0.602763

I found the article advanced multi-indexing in the pandas documentation very helpful

like image 180
sgDysregulation Avatar answered Oct 26 '22 13:10

sgDysregulation