I'm working with a pandas DataFrame that represents a graph. The dataframe is indexed by a MultiIndex that indicates the node endpoints. Setup: <pre class="prettyprint"><code>import pandas as pd import numpy as np import itertools as it edges = list(it.combinations([1, 2, 3, 4], 2)) # Define a dataframe to represent a graph index = pd.MultiIndex.from_tuples(edges, names=['u', 'v']) df = pd.DataFrame.from_dict({ 'edge_id': list(range(len(edges))), 'edge_weight': np.random.RandomState(0).rand(len(edges)), }) df.index = index print(df) ## -- End pasted text -- edge_id edge_weight u v 1 2 0 0.5488 3 1 0.7152 4 2 0.6028 2 3 3 0.5449 4 4 0.4237 3 4 5 0.6459 </code></pre> I want to be able to index into the graph using an edge subset, which is why I've chosen to use a <code>MultiIndex</code>. I'm able to do this just fine as long as the input to <code>df.loc</code> is a list of tuples. <pre class="prettyprint"><code># Select subset of graph using list-of-tuple indexing edge_subset1 = [edges[x] for x in [0, 3, 2]] df.loc[edge_subset1] ## -- End pasted text -- edge_id edge_weight u v 1 2 0 0.5488 2 3 3 0.5449 1 4 2 0.6028 </code></pre> However, when my list of edges is a numpy array (as it often is), or a list of lists, then I seem to be unable to use the <code>df.loc</code> property. <pre class="prettyprint"><code># Why can't I do this if `edge_subset2` is a numpy array? edge_subset2 = np.array(edge_subset1) df.loc[edge_subset2] ## -- End pasted text -- TypeError: unhashable type: 'numpy.ndarray' </code></pre> It would be ok if I could just all <code>arr.tolist()</code>, but this results in a seemingly different error. <pre class="prettyprint"><code># Why can't I do this if `edge_subset2` is a numpy array? # or if `edge_subset3` is a list-of-lists? edge_subset3 = edge_subset2.tolist() df.loc[edge_subset3] ## -- End pasted text -- TypeError: '[1, 2]' is an invalid key </code></pre> It's a real pain to have to use <code>list(map(tuple, arr.tolist()))</code> every time I want to select a subset. It would be nice if there was another way to do this. The main questsions are: <ul> <li>Why can't I use a numpy array with <code>.loc</code>? Is it because under the hood a dictionary is being used to map the multi-index labels to positional indices?</li> <li>Why does a list-of-lists give a different error? Maybe its really the same problem its just caught a different way?</li> <li>Is there another (ideally less-verbose) way to lookup a subset of a dataframe with a numpy array of multi-index labels that I'm unaware of?</li> </ul>

A dictionary keys are immutable, that's basically why you cant use a list of lists to access multi-index. To be able to access multi-indexed data using <code>loc</code> you need to convert your <code>numpy</code> array to a list of tuples; tuples are immutable, one way to do so is using <code>map</code> as you mentioned If you want to avoid using map and you're reading the edges form a csv file, you could read them into a data frame then use <code>to_records</code> with the <code>index</code> attribute set to <code>False</code>, Another way could be by creating a multi-index from the <code>ndarray</code> but you have to transpose the list before passing it so that each level is one list in the array <pre class="prettyprint"><code>import pandas as pd df1 = df.loc[pd.MultiIndex.from_arrays(edge_subset2.T)] print(df1) #outputs edge_id edge_weight ------ --------- ------------- (1, 2) 0 0.548814 (2, 3) 3 0.544883 (1, 4) 2 0.602763 </code></pre> I found the article advanced multi-indexing in the pandas documentation very helpful

Pandas MultiIndex lookup with Numpy arrays

Tags:

python

pandas

numpy

I'm working with a pandas DataFrame that represents a graph. The dataframe is indexed by a MultiIndex that indicates the node endpoints.

Setup:

import pandas as pd
import numpy as np
import itertools as it
edges = list(it.combinations([1, 2, 3, 4], 2))

# Define a dataframe to represent a graph
index = pd.MultiIndex.from_tuples(edges, names=['u', 'v'])
df = pd.DataFrame.from_dict({
    'edge_id': list(range(len(edges))),
    'edge_weight': np.random.RandomState(0).rand(len(edges)),
})
df.index = index
print(df)
## -- End pasted text --
     edge_id  edge_weight
u v                      
1 2        0       0.5488
  3        1       0.7152
  4        2       0.6028
2 3        3       0.5449
  4        4       0.4237
3 4        5       0.6459

I want to be able to index into the graph using an edge subset, which is why I've chosen to use a MultiIndex. I'm able to do this just fine as long as the input to df.loc is a list of tuples.

# Select subset of graph using list-of-tuple indexing
edge_subset1 = [edges[x] for x in [0, 3, 2]]
df.loc[edge_subset1]
## -- End pasted text --
     edge_id  edge_weight
u v                      
1 2        0       0.5488
2 3        3       0.5449
1 4        2       0.6028

However, when my list of edges is a numpy array (as it often is), or a list of lists, then I seem to be unable to use the df.loc property.

# Why can't I do this if `edge_subset2` is a numpy array?
edge_subset2 = np.array(edge_subset1)
df.loc[edge_subset2]
## -- End pasted text --
TypeError: unhashable type: 'numpy.ndarray'

It would be ok if I could just all arr.tolist(), but this results in a seemingly different error.

# Why can't I do this if `edge_subset2` is a numpy array?
# or if `edge_subset3` is a list-of-lists?
edge_subset3 = edge_subset2.tolist()
df.loc[edge_subset3]
## -- End pasted text --
TypeError: '[1, 2]' is an invalid key

It's a real pain to have to use list(map(tuple, arr.tolist())) every time I want to select a subset. It would be nice if there was another way to do this.

The main questsions are:

Why can't I use a numpy array with .loc? Is it because under the hood a dictionary is being used to map the multi-index labels to positional indices?
Why does a list-of-lists give a different error? Maybe its really the same problem its just caught a different way?
Is there another (ideally less-verbose) way to lookup a subset of a dataframe with a numpy array of multi-index labels that I'm unaware of?

485

asked Jan 05 '17 19:01

Erotemic

1 Answers

A dictionary keys are immutable, that's basically why you cant use a list of lists to access multi-index.

To be able to access multi-indexed data using loc you need to convert your numpy array to a list of tuples; tuples are immutable, one way to do so is using map as you mentioned

If you want to avoid using map and you're reading the edges form a csv file, you could read them into a data frame then use to_records with the index attribute set to False, Another way could be by creating a multi-index from the ndarray but you have to transpose the list before passing it so that each level is one list in the array

import pandas as pd   

df1 = df.loc[pd.MultiIndex.from_arrays(edge_subset2.T)]


print(df1)

#outputs
          edge_id    edge_weight
------  ---------  -------------
(1, 2)          0       0.548814
(2, 3)          3       0.544883
(1, 4)          2       0.602763

I found the article advanced multi-indexing in the pandas documentation very helpful

180

answered Oct 26 '22 13:10

sgDysregulation

Related questions
                            
                                Python pdb on python script run as package
                            
                                Strange blocking behaviour with gevent/grequests over HTTPS
                            
                                Bind to pgcrypto from python
                            
                                Pympler summary doesn't seem to make sense
                            
                                Take an input stream from the desktop in OpenCV
                            
                                Python module import works for one file, fails for another
                            
                                gdal_calc amin fails when passing more than 23 input files
                            
                                Setting chromedriver proxy auth with Selenium using Python
                            
                                Obtaining a prediction in Keras
                            
                                How to segment blood vessels python opencv
                            
                                Jupyter + rpy2 outputs to command prompt instead of notebook cell
                            
                                How to initialize OpenGL context with PyGame instead of GLUT
                            
                                "Can't initialize character set utf8mb4" with Windows mysql-python
                            
                                PyDev debugging: do not open "_pydev_execfile" at the end
                            
                                Automagically propagating deletion when using a bidirectional association_proxy
                            
                                Python Documentation (:obj:`str`) vs (str)
                            
                                Detect bounced emails in Python smtplib
                            
                                What's the difference between .post() , .create() and perform_create() in views.py and .create() in serializers.py
                            
                                httplib.BadStatusLine: '' on Linux but not Mac
                            
                                Initialize field only once in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With