Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas pivot table ValueError: Index contains duplicate entries, cannot reshape

Tags:

python

pandas

I have a dataframe as shown below (top 3 rows):

Sample_Name Sample_ID   Sample_Type IS  Component_Name  IS_Name Component_Group_Name    Outlier_Reasons Actual_Concentration    Area    Height  Retention_Time  Width_at_50_pct Used    Calculated_Concentration    Accuracy
Index                                                               
1   20170824_ELN147926_HexLacCer_Plasma_A-1-1   NaN Unknown True    GluCer(d18:1/12:0)_LCB_264.3    NaN NaN NaN 0.1 2.733532e+06    5.963840e+05    2.963911    0.068676    True    NaN NaN
2   20170824_ELN147926_HexLacCer_Plasma_A-1-1   NaN Unknown True    GluCer(d18:1/17:0)_LCB_264.3    NaN NaN NaN 0.1 2.945190e+06    5.597470e+05    2.745026    0.068086    True    NaN NaN
3   20170824_ELN147926_HexLacCer_Plasma_A-1-1   NaN Unknown False   GluCer(d18:1/16:0)_LCB_264.3    GluCer(d18:1/17:0)_LCB_264.3    NaN NaN NaN 3.993535e+06    8.912731e+05    2.791991    0.059864    True    125.927659773487    NaN

When trying to generate a pivot table:

pivoted_report_conc = raw_report.pivot(index = "Sample_Name", columns = 'Component_Name', values = "Calculated_Concentration")

I get the following error:

ValueError: Index contains duplicate entries, cannot reshape

I tried resetting the index but it did not help. I couldn't find any duplicate values in the "Index" column. Could someone please help identify the problem here?

The expected output would be a reshaped dataframe with only the unique component names as columns and respective concentrations for each sample name:

Sample_Name    GluCer(d18:1/12:0)_LCB_264.3    GluCer(d18:1/17:0)_LCB_264.3    GluCer(d18:1/16:0)_LCB_264.3
20170824_ELN147926_HexLacCer_Plasma_A-1-1    NaN    NaN    125.927659773487

To clarify, I am not looking to aggregate the data, just reshape it.

like image 497
kkhatri99 Avatar asked Aug 31 '17 20:08

kkhatri99


People also ask

How do you solve index contains duplicate entries Cannot reshape?

You can avoid this by retaining the default index column (row #) and while setting the index using " id ", " date " and " location ", add it in " append " mode instead of the default overwrite mode. Once this is done, your index columns will still have the default index along with the set indexes.

What is the difference between pivot () and Pivot_table () functions?

Basically, the pivot_table() function is a generalization of the pivot() function that allows aggregation of values — for example, through the len() function in the previous example. Pivot only works — or makes sense — if you need to pivot a table and show values without any aggregation.

How do you use pivot in pandas?

To use the pivot method in Pandas, you need to specify three parameters: Index: Which column should be used to identify and order your rows vertically. Columns: Which column should be used to create the new columns in our reshaped DataFrame.


2 Answers

You can use groupby() and unstack() to get around the error you're seeing with pivot().

Here's some example data, with a few edge cases added, and some column values removed or substituted for MCVE:

# df
      Sample_Name  Sample_ID     IS Component_Name Calculated_Concentration Outlier_Reasons
Index                                                                    
1             foo        NaN   True              x                  NaN              NaN  
1             foo        NaN   True              y                  NaN              NaN 
2             foo        NaN   False             z            125.92766              NaN 
2             bar        NaN   False             x                 1.00              NaN  
2             bar        NaN   False             y                 2.00              NaN  
2             bar        NaN   False             z                  NaN              NaN  

(df.groupby(['Sample_Name','Component_Name'])
   .Calculated_Concentration
   .first()
   .unstack()
)

Output:

Component_Name    x   y          z
Sample_Name                       
bar             1.0 2.0        NaN
foo             NaN NaN  125.92766
like image 164
andrew_reece Avatar answered Oct 26 '22 03:10

andrew_reece


You should be able to accomplish what you are looking to do by using the the pandas.pivot_table() functionality as documented here.

With your dataframe stored as df use the following code:

import pandas as pd
df = pd.read_table('table_from_which_to_read')

new_df = pd.pivot_table(df,index=['Simple Name'], columns = 'Component_Name', values = "Calculated_Concentration")

If you want something other than the mean of the concentration value, you will need to change the aggfunc parameter.

EDIT

Since you don't want to aggregate over the values, you can reshape the data by using the set_index function on your DataFrame with documentation found here.

import pandas as pd
df = pd.DataFrame({'NonUniqueLabel':['Item1','Item1','Item1','Item2'],
     'SemiUniqueValue':['X','Y','Z','X'], 'Value':[1.0,100,5,None])

new_df = df.set_index(['NonUniqueLabel','SemiUniqueLabel'])

The resulting table should look like what you expect the results to be and will have a multi-index.

like image 42
J-Eubanks Avatar answered Oct 26 '22 02:10

J-Eubanks