I am trying to unstack() data in a Pandas dataframe, but I keep getting this error, and I'm not sure why. Here is my code so far with a sample of my data. My attempt to fix it was to remove all rows where voteId was not a number, which did not work with my actual dataset. This is happening both in an Anaconda notebook (where I am developing) and in my production env when I deploy the code.
I could not figure out how to reproduce the error in my sample code... possibly due to a typecasting issue that doesnt exist when you instantiate the dataframe like I did in the sample?
#dataset simulate likely input
# d = {'vote': [100, 50,1,23,55,67,89,44],
# 'vote2': [10, 2,18,26,77,99,9,40],
# 'ballot1': ['a','b','a','a','b','a','c','c'],
# 'voteId':[1,2,3,4,5,'aaa',7,'NaN']}
# df1=pd.DataFrame(d)
#########################################################
df1=df1.drop_duplicates(['voteId','ballot1'],keep='last')
s=df1[:10].set_index(['voteId','ballot1'],verify_integrity=True).unstack()
s.columns=s.columns.map('(ballot1={0[1]}){0[0]}'.format)
dflw=pd.DataFrame(s)
Full error message/stack trace:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-10-0a520180a8d9> in <module>()
22 df1=df1.drop_duplicates(['voteId','ballot1'],keep='last')
23
---> 24 s=df1[:10].set_index(['voteId','ballot1'],verify_integrity=True).unstack()
25 s.columns=s.columns.map('(ballot1={0[1]}){0[0]}'.format)
26 dflw=pd.DataFrame(s)
~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in unstack(self, level, fill_value)
4567 """
4568 from pandas.core.reshape.reshape import unstack
-> 4569 return unstack(self, level, fill_value)
4570
4571 _shared_docs['melt'] = ("""
~/anaconda3/lib/python3.6/site-packages/pandas/core/reshape/reshape.py in unstack(obj, level, fill_value)
467 if isinstance(obj, DataFrame):
468 if isinstance(obj.index, MultiIndex):
--> 469 return _unstack_frame(obj, level, fill_value=fill_value)
470 else:
471 return obj.T.stack(dropna=False)
~/anaconda3/lib/python3.6/site-packages/pandas/core/reshape/reshape.py in _unstack_frame(obj, level, fill_value)
480 unstacker = partial(_Unstacker, index=obj.index,
481 level=level, fill_value=fill_value)
--> 482 blocks = obj._data.unstack(unstacker)
483 klass = type(obj)
484 return klass(blocks)
~/anaconda3/lib/python3.6/site-packages/pandas/core/internals.py in unstack(self, unstacker_func)
4349 new_columns = new_columns[columns_mask]
4350
-> 4351 bm = BlockManager(new_blocks, [new_columns, new_index])
4352 return bm
4353
~/anaconda3/lib/python3.6/site-packages/pandas/core/internals.py in __init__(self, blocks, axes, do_integrity_check, fastpath)
3035 self._consolidate_check()
3036
-> 3037 self._rebuild_blknos_and_blklocs()
3038
3039 def make_empty(self, axes=None):
~/anaconda3/lib/python3.6/site-packages/pandas/core/internals.py in _rebuild_blknos_and_blklocs(self)
3127
3128 if (new_blknos == -1).any():
-> 3129 raise AssertionError("Gaps in blk ref_locs")
3130
3131 self._blknos = new_blknos
AssertionError: Gaps in blk ref_locs
To get the real data triggered the exception, add extra debug information
Modify
~/anaconda3/lib/python3.6/site-packages/pandas/core/internals.py
add lines to class BlockManager()
def __init__(self)
print('BlockManager blocks')
pprint(self.blocks)
print('BlockManager axes')
pprint(self.axes)
You will the data:
_unstack_frame level -1 fill_value None vote vote2 ballot1 voteId NaN xx 100.0 10.0 False aaa 50.1 2.0 -1 \n 1.0 18.0 True NaN 23.0 26.0 b False 55.0 77.0 a \ 67.0 99.0 c 89.0 9.0 8 44.0 NaN
Modify
~/anaconda3/lib/python3.6/site-packages/pandas/core/reshape/reshape.py
def __unstack_frame(self, ...)
from pprint import pprint
print('_unstack_frame level {} fill_value {} {}'.format(level, fill_value, type(obj)))
pprint(obj)
You will see data:
BlockManager blocks (FloatBlock: slice(0, 16, 1), 16 x 8, dtype: float64,) BlockManager axes [MultiIndex(levels=[[u'vote', u'vote2'], [False, 8, u'\n', u' ', u'\', u'aaa', u'xx']], labels=[[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1], [-1, 0, 1, 2, 3, 4, 5, 6, -1, 0, 1, 2, 3, 4, 5, 6]], names=[None, u'voteId']), Index([nan, -1, False, True, u'', u'a', u'b', u'c'], dtype='object', name=u'ballot1')]
I did trigger an exception with another example:
File "/usr/lib64/python2.7/site-packages/pandas/core/internals.py", line 2902, in _rebuild_blknos_and_blklocs raise AssertionError("Gaps in blk ref_locs") AssertionError: Gaps in blk ref_locs
with debugging info
BlockManager blocks (FloatBlock: [-1, -1, -1], 3 x 2, dtype: float64,) BlockManager axes [Index([aaa, bbb, ccc], dtype='object'), Int64Index([0, 1], dtype='int64')]
I made some testing with your example code.
Observation 1:
This is one possible minimal verifiable code about the issue:
import pandas as pd
from IPython.display import display
#dataset simulate likely input
d = {'vote': [100, 50,1,23,55,67,89,44],
'vote2': [10, 'a',18,55,77,99,9,40],
'ballot1': [1,None,3,4,5,6,7,8],
'voteId':[1,2,3,4,5,6,7,8]}
df1 = pd.DataFrame(d)
#########################################################
df1 = df1.drop_duplicates(['voteId','ballot1'],keep='last')
s = df1[:10].reset_index().set_index(['voteId','ballot1'],verify_integrity=True).unstack()
s.columns=s.columns.map('(ballot1={0[1]}){0[0]}'.format)
dflw=pd.DataFrame(s)
display(dflw)
Making assuption, that data can be what ever I little bit modified it and found out the following (in basis of this example):
1) For some reason the indexes has to be very similar to each other,
but differ by one None in one of them.
2) vote and vote2 need to have one number in common with each other
3) another vote need to include anomality (letter or None)
Observation 2:
I made also another dataSet (maybe more near to your one):
d = {'vote': [10, None,2,23,55,67,89,44],
'vote2': [10,2,3,55,77,99,9,40],
'ballot1': [1,None,3,4,5,6,7,8],
'voteId':['a','b','a','a','c','a','c','a']}
df1 = pd.DataFrame(d)
There is interesting, that ballot1 and voteId are in different order than in your case and the case where they are in same order than yours, works just fine.
There my observation is that ballot1
is the index that really needs that cap to fail and one vote is None and a shared value between vote series is needed.
Discuss:
If it is possible, that both ballot1 and voteId (the indexes) have solely integers, but ballot1 also some anomalities, depending on the data in vote columns, this error maybe raised.
The assertion is raised when index column values have a gap and it relates possibly to the df1[:10] command you have, like Zev commented about the issue on Github.
In my example cases though the suggested workaround from Github had no effect. Better off is to get rid of None values in data, that already is in good shape.
Sidenotes:
I don't know if the ballot1
data is allowed to have integers, but at least if yes, these kind of scenarios exist, where the error can occur. Whether these are helpful is determined about your case, which you at the point of writing your question did not know clearly. Now you have some pointers to try at least.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With