Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AssertionError: Gaps in blk ref_locs when unstack() dataframe

I am trying to unstack() data in a Pandas dataframe, but I keep getting this error, and I'm not sure why. Here is my code so far with a sample of my data. My attempt to fix it was to remove all rows where voteId was not a number, which did not work with my actual dataset. This is happening both in an Anaconda notebook (where I am developing) and in my production env when I deploy the code.

I could not figure out how to reproduce the error in my sample code... possibly due to a typecasting issue that doesnt exist when you instantiate the dataframe like I did in the sample?

#dataset simulate likely input
# d = {'vote': [100, 50,1,23,55,67,89,44], 
#      'vote2': [10, 2,18,26,77,99,9,40], 
#      'ballot1': ['a','b','a','a','b','a','c','c'],
#      'voteId':[1,2,3,4,5,'aaa',7,'NaN']}
# df1=pd.DataFrame(d)
#########################################################

df1=df1.drop_duplicates(['voteId','ballot1'],keep='last')

s=df1[:10].set_index(['voteId','ballot1'],verify_integrity=True).unstack()
s.columns=s.columns.map('(ballot1={0[1]}){0[0]}'.format) 
dflw=pd.DataFrame(s)

Full error message/stack trace:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-10-0a520180a8d9> in <module>()
     22 df1=df1.drop_duplicates(['voteId','ballot1'],keep='last')
     23 
---> 24 s=df1[:10].set_index(['voteId','ballot1'],verify_integrity=True).unstack()
     25 s.columns=s.columns.map('(ballot1={0[1]}){0[0]}'.format)
     26 dflw=pd.DataFrame(s)

~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in unstack(self, level, fill_value)
   4567         """
   4568         from pandas.core.reshape.reshape import unstack
-> 4569         return unstack(self, level, fill_value)
   4570 
   4571     _shared_docs['melt'] = ("""

~/anaconda3/lib/python3.6/site-packages/pandas/core/reshape/reshape.py in unstack(obj, level, fill_value)
    467     if isinstance(obj, DataFrame):
    468         if isinstance(obj.index, MultiIndex):
--> 469             return _unstack_frame(obj, level, fill_value=fill_value)
    470         else:
    471             return obj.T.stack(dropna=False)

~/anaconda3/lib/python3.6/site-packages/pandas/core/reshape/reshape.py in _unstack_frame(obj, level, fill_value)
    480         unstacker = partial(_Unstacker, index=obj.index,
    481                             level=level, fill_value=fill_value)
--> 482         blocks = obj._data.unstack(unstacker)
    483         klass = type(obj)
    484         return klass(blocks)

~/anaconda3/lib/python3.6/site-packages/pandas/core/internals.py in unstack(self, unstacker_func)
   4349         new_columns = new_columns[columns_mask]
   4350 
-> 4351         bm = BlockManager(new_blocks, [new_columns, new_index])
   4352         return bm
   4353 

~/anaconda3/lib/python3.6/site-packages/pandas/core/internals.py in __init__(self, blocks, axes, do_integrity_check, fastpath)
   3035         self._consolidate_check()
   3036 
-> 3037         self._rebuild_blknos_and_blklocs()
   3038 
   3039     def make_empty(self, axes=None):

~/anaconda3/lib/python3.6/site-packages/pandas/core/internals.py in _rebuild_blknos_and_blklocs(self)
   3127 
   3128         if (new_blknos == -1).any():
-> 3129             raise AssertionError("Gaps in blk ref_locs")
   3130 
   3131         self._blknos = new_blknos

AssertionError: Gaps in blk ref_locs
like image 238
Rilcon42 Avatar asked Mar 26 '18 00:03

Rilcon42


2 Answers

To get the real data triggered the exception, add extra debug information

Modify ~/anaconda3/lib/python3.6/site-packages/pandas/core/internals.py

add lines to class BlockManager()

def __init__(self)
    print('BlockManager blocks')
    pprint(self.blocks)
    print('BlockManager axes')
    pprint(self.axes)

You will the data:

_unstack_frame level -1 fill_value None 

                 vote  vote2
ballot1 voteId              
NaN     xx      100.0   10.0
False   aaa      50.1    2.0
-1      \n        1.0   18.0
True    NaN      23.0   26.0
b       False    55.0   77.0
a       \        67.0   99.0
c                89.0    9.0
        8        44.0    NaN

Modify ~/anaconda3/lib/python3.6/site-packages/pandas/core/reshape/reshape.py

def __unstack_frame(self, ...)
    from pprint import pprint
    print('_unstack_frame level {} fill_value {} {}'.format(level, fill_value, type(obj)))
    pprint(obj)

You will see data:


BlockManager blocks
(FloatBlock: slice(0, 16, 1), 16 x 8, dtype: float64,)
BlockManager axes
[MultiIndex(levels=[[u'vote', u'vote2'], [False, 8, u'\n', u' ', u'\', u'aaa', u'xx']],
           labels=[[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1], [-1, 0, 1, 2, 3, 4, 5, 6, -1, 0, 1, 2, 3, 4, 5, 6]],
           names=[None, u'voteId']),
 Index([nan, -1, False, True, u'', u'a', u'b', u'c'], dtype='object', name=u'ballot1')]

I did trigger an exception with another example:

  File "/usr/lib64/python2.7/site-packages/pandas/core/internals.py", line 2902, in _rebuild_blknos_and_blklocs
    raise AssertionError("Gaps in blk ref_locs")
AssertionError: Gaps in blk ref_locs


with debugging info

BlockManager blocks
(FloatBlock: [-1, -1, -1], 3 x 2, dtype: float64,)
BlockManager axes
[Index([aaa, bbb, ccc], dtype='object'), Int64Index([0, 1], dtype='int64')]

like image 80
Gang Avatar answered Nov 18 '22 06:11

Gang


I made some testing with your example code.

Observation 1:

This is one possible minimal verifiable code about the issue:

import pandas as pd
from IPython.display import display

#dataset simulate likely input
d = {'vote': [100, 50,1,23,55,67,89,44], 
     'vote2': [10, 'a',18,55,77,99,9,40], 
     'ballot1': [1,None,3,4,5,6,7,8],
     'voteId':[1,2,3,4,5,6,7,8]}
df1 = pd.DataFrame(d)
#########################################################

df1 = df1.drop_duplicates(['voteId','ballot1'],keep='last')

s = df1[:10].reset_index().set_index(['voteId','ballot1'],verify_integrity=True).unstack()
s.columns=s.columns.map('(ballot1={0[1]}){0[0]}'.format)
dflw=pd.DataFrame(s)
display(dflw)

Making assuption, that data can be what ever I little bit modified it and found out the following (in basis of this example):

1) For some reason the indexes has to be very similar to each other, 
   but differ by one None in one of them.
2) vote and vote2 need to have one number in common with each other
3) another vote need to include anomality (letter or None)

Observation 2:

I made also another dataSet (maybe more near to your one):

d = {'vote': [10, None,2,23,55,67,89,44],
     'vote2': [10,2,3,55,77,99,9,40],
     'ballot1': [1,None,3,4,5,6,7,8],
     'voteId':['a','b','a','a','c','a','c','a']}
df1 = pd.DataFrame(d)

There is interesting, that ballot1 and voteId are in different order than in your case and the case where they are in same order than yours, works just fine.

There my observation is that ballot1 is the index that really needs that cap to fail and one vote is None and a shared value between vote series is needed.

Discuss:

If it is possible, that both ballot1 and voteId (the indexes) have solely integers, but ballot1 also some anomalities, depending on the data in vote columns, this error maybe raised.

The assertion is raised when index column values have a gap and it relates possibly to the df1[:10] command you have, like Zev commented about the issue on Github.

In my example cases though the suggested workaround from Github had no effect. Better off is to get rid of None values in data, that already is in good shape.

Sidenotes:

I don't know if the ballot1 data is allowed to have integers, but at least if yes, these kind of scenarios exist, where the error can occur. Whether these are helpful is determined about your case, which you at the point of writing your question did not know clearly. Now you have some pointers to try at least.

like image 45
mico Avatar answered Nov 18 '22 07:11

mico