Using pandas 0.11 on python 2.7.3 I am trying to pivot a simple dataframe with the following values:
StudentID QuestionID Answer DateRecorded
0 1234 bar a 2012/01/21
1 1234 foo c 2012/01/22
2 4321 bop a 2012/01/22
3 5678 bar a 2012/01/24
4 8765 baz b 2012/02/13
5 4321 baz b 2012/02/15
6 8765 bop b 2012/02/16
7 5678 bop c 2012/03/15
8 5678 foo a 2012/04/01
9 1234 baz b 2012/04/11
10 8765 bar a 2012/05/03
11 4321 bar a 2012/05/04
12 5678 baz c 2012/06/01
13 1234 bar b 2012/11/01
I am using the following command:
df.pivot(index='StudentID', columns='QuestionID')
But I am getting the following error:
ReshapeError: Index contains duplicate entries, cannot reshape
Note that the same dataframe without the last line
13 1234 bar b 2012/11/01
The pivot results successfully in following:
Answer DateRecorded
QuestionID bar baz bop foo bar baz bop foo
StudentID
1234 a b NaN c 2012/01/21 2012/04/11 NaN 2012/01/22
4321 a b a NaN 2012/05/04 2012/02/15 2012/01/22 NaN
5678 a c c a 2012/01/24 2012/06/01 2012/03/15 2012/04/01
8765 a b b NaN 2012/05/03 2012/02/13 2012/02/16 NaN
I am new to pivoting and would like to know why having duplicate StudentID, QuestionID pair causing this problem? And, how can I fix this using the df.pivot() function?
thank you.
What do you expect your pivot table to look like with the duplicate entries? I'm not sure it would make sense to have multiple elements for (1234, bar) in the pivot table. Your data looks like it's naturally indexed by (questionID, studentID, dateRecorded).
If you go with the Hierarchical Index approach (they're really not that complicated!) I'd try:
In [104]: df2 = df.set_index(['StudentID', 'QuestionID', 'DateRecorded'])
In [105]: df2
Out[105]:
Answer
StudentID QuestionID DateRecorded
1234 bar 2012/01/21 a
foo 2012/01/22 c
4321 bop 2012/01/22 a
5678 bar 2012/01/24 a
8765 baz 2012/02/13 b
4321 baz 2012/02/15 b
8765 bop 2012/02/16 b
5678 bop 2012/03/15 c
foo 2012/04/01 a
1234 baz 2012/04/11 b
8765 bar 2012/05/03 a
4321 bar 2012/05/04 a
5678 baz 2012/06/01 c
1234 bar 2012/11/01 b
In [106]: df2.unstack('QuestionID')
Out[106]:
Answer
QuestionID bar baz bop foo
StudentID DateRecorded
1234 2012/01/21 a NaN NaN NaN
2012/01/22 NaN NaN NaN c
2012/04/11 NaN b NaN NaN
2012/11/01 b NaN NaN NaN
4321 2012/01/22 NaN NaN a NaN
2012/02/15 NaN b NaN NaN
2012/05/04 a NaN NaN NaN
5678 2012/01/24 a NaN NaN NaN
2012/03/15 NaN NaN c NaN
2012/04/01 NaN NaN NaN a
2012/06/01 NaN c NaN NaN
8765 2012/02/13 NaN b NaN NaN
2012/02/16 NaN NaN b NaN
2012/05/03 a NaN NaN NaN
Otherwise you can come up with some rule to determine which of the multiple entries to take for the pivot table, and avoid the Hierarchical index.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With