Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas long to wide format with multi-index

I have a dataframe that looks like this:

data.head()
Out[2]: 
        Area Area Id                  Variable Name Variable Id  Year  \
0  Argentina       9  Conservation agriculture area        4454  1982   
1  Argentina       9  Conservation agriculture area        4454  1987   
2  Argentina       9  Conservation agriculture area        4454  1992   
3  Argentina       9  Conservation agriculture area        4454  1997   
4  Argentina       9  Conservation agriculture area        4454  2002   
     Value Symbol Md  
0      2.0            
1      6.0            
2    500.0       

That I would like to pivot so that Variable Name is the columns, Area and Year are the index and Value are the values. The most intuitive way to me is using:

data.pivot(index=['Area', 'Year'], columns='Variable Name', values='Value)

However I get the error:

Traceback (most recent call last):
  File "C:\Users\patri\Miniconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2862, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-4-4c786386b703>", line 1, in <module>
    pd.concat(data_list).pivot(index=['Area', 'Year'], columns='Variable Name', values='Value')
  File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\frame.py", line 3853, in pivot
    return pivot(self, index=index, columns=columns, values=values)
  File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\reshape\reshape.py", line 377, in pivot
    index=MultiIndex.from_arrays([index, self[columns]]))
  File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\series.py", line 250, in __init__
    data = SingleBlockManager(data, index, fastpath=True)
  File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\internals.py", line 4117, in __init__
    fastpath=True)
  File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\internals.py", line 2719, in make_block
    return klass(values, ndim=ndim, fastpath=fastpath, placement=placement)
  File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\internals.py", line 1844, in __init__
    placement=placement, **kwargs)
  File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\internals.py", line 115, in __init__
    len(self.mgr_locs)))
ValueError: Wrong number of items passed 119611, placement implies 2

How should I interpret this? I've also tried another way:

data.set_index(['Area', 'Variable Name', 'Year']).loc[:, 'Value'].unstack('Variable Name')

to try to get the same result, but I get this error:

Traceback (most recent call last):
  File "C:\Users\patri\Miniconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2862, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-5-222325ea01e1>", line 1, in <module>
    pd.concat(data_list).set_index(['Area', 'Variable Name', 'Year']).loc[:, 'Value'].unstack('Variable Name')
  File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\series.py", line 2028, in unstack
    return unstack(self, level, fill_value)
  File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\reshape\reshape.py", line 458, in unstack
    fill_value=fill_value)
  File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\reshape\reshape.py", line 110, in __init__
    self._make_selectors()
  File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\reshape\reshape.py", line 148, in _make_selectors
    raise ValueError('Index contains duplicate entries, '
ValueError: Index contains duplicate entries, cannot reshape

Is there something wrong with the data? I've confirmed that there are no duplicate combinations of Area, Variable Name, and Year in any row of the dataframe, so I don't think there should be any duplicate entries but I could be wrong. How can I convert from long to wide format given that both of these methods are not currently working? I've checked answers here and here, but they are both cases where some type I aggregation is involved.

I've tried using pivot_table like so:

data.pivot_table(index=['Area', 'Year'], columns='Variable Name', values='Value')

but I think some type of aggregation is being done and there are a lot of missing values in the dataset which leads to this error:

Traceback (most recent call last):
  File "C:\Users\patri\Miniconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2862, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-7-77b28d2f0dbb>", line 1, in <module>
    pd.concat(data_list).pivot_table(index=['Area', 'Year'], columns='Variable Name', values='Value')
  File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\reshape\pivot.py", line 136, in pivot_table
    agged = grouped.agg(aggfunc)
  File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\groupby.py", line 4036, in aggregate
    return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
  File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\groupby.py", line 3468, in aggregate
    result, how = self._aggregate(arg, _level=_level, *args, **kwargs)
  File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\base.py", line 435, in _aggregate
    **kwargs), None
  File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\base.py", line 391, in _try_aggregate_string_function
    return f(*args, **kwargs)
  File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\groupby.py", line 1037, in mean
    return self._cython_agg_general('mean', **kwargs)
  File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\groupby.py", line 3354, in _cython_agg_general
    how, alt=alt, numeric_only=numeric_only)
  File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\groupby.py", line 3425, in _cython_agg_blocks
    raise DataError('No numeric types to aggregate')
pandas.core.base.DataError: No numeric types to aggregate
like image 343
pbreach Avatar asked Dec 24 '22 12:12

pbreach


1 Answers

I think you need first convert column Value to numeric and then use pivot_table with default aggregate function mean:

#if all float data saved as strings
data['Value'] = data['Value'].astype(float)
#if some bad data like strings and first method return value error
data['Value'] = pd.to_numeric(data['Value'], errors='coerce')

data.pivot_table(index=['Area', 'Year'], columns='Variable Name', values='Value')

Or:

data.groupby(['Area', 'Variable Name', 'Year'])[ 'Value'].mean().unstack('Variable Name')
like image 83
jezrael Avatar answered Dec 27 '22 01:12

jezrael