Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pivoting a Pandas Dataframe containing strings - 'No numeric types to aggregate' error

Tags:

There is a good number of questions about this error, but after looking around I'm still not able to find/wrap my mind around a solution yet. I'm trying to pivot a data frame with strings, to get some row data to become columns, but not working out so far.

Shape of my df

<class 'pandas.core.frame.DataFrame'>
Int64Index: 515932 entries, 0 to 515931
Data columns (total 5 columns):
id                 515932 non-null object
cc_contact_id      515932 non-null object
Network_Name       515932 non-null object
question           515932 non-null object
response_answer    515932 non-null object
dtypes: object(5)
memory usage: 23.6+ MB

Sample format

id  contact_id  question    response_answer
16  137519  2206    State   Ca
17  137520  2206    State   Ca
18  137521  2206    State   Ca
19  137522  2206    State   Ca
20  137523  2208    City    Lancaster
21  137524  2208    City    Lancaster
22  137525  2208    City    Lancaster
23  137526  2208    City    Lancaster
24  137527  2208    Trip_End Location   Home
25  137528  2208    Trip_End Location   Home
26  137529  2208    Trip_End Location   Home
27  137530  2208    Trip_End Location   Home

What I would like to pivot to

id  contact_id      State   City       Trip_End Location
16  137519  2206    Ca      None       None None
20  137523  2208    None    Lancaster  None None
24  137527  2208    None    None       None Home
etc. etc. 

Where the question values become the columns, with the response_answer being in it's corresponding column, and retaining the ids

What I have tried

unified_df = pd.DataFrame(unified_data, columns=target_table_headers, dtype=object)

pivot_table = unified_df.pivot_table('response_answer',['id','cc_contact_id'],'question')
# OR
pivot_table = unified_df.pivot_table('response_answer','question')

DataError: No numeric types to aggregate

What is the way to pivot a data frame with string values?

like image 759
jmhead Avatar asked Dec 23 '15 19:12

jmhead


People also ask

What is the difference between pivot and pivot table?

Basically, the pivot_table() function is a generalization of the pivot() function that allows aggregation of values — for example, through the len() function in the previous example. Pivot only works — or makes sense — if you need to pivot a table and show values without any aggregation. Here's an example.

What is Aggfunc?

The aggfunc argument of pivot_table takes a function or list of functions but not dict. aggfunc : function, default numpy.mean, or list of functions If list of functions passed, the resulting pivot table will have hierarchical columns whose top level are the function names (inferred from the function objects themselves ...

How do I convert a pivot table into a DataFrame?

DataFrame - pivot_table() function The pivot_table() function is used to create a spreadsheet-style pivot table as a DataFrame. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.

What is DataFrame pivot?

DataFrame - pivot() function The pivot() function is used to reshaped a given DataFrame organized by given index / column values. This function does not support data aggregation, multiple values will result in a MultiIndex in the columns.


2 Answers

The default aggfunc in pivot_table is np.sum and it doesn't know what to do with strings and you haven't indicated what the index should be properly. Trying something like:

pivot_table = unified_df.pivot_table(index=['id', 'contact_id'],
                                     columns='question', 
                                     values='response_answer',
                                     aggfunc=lambda x: ' '.join(x))

This explicitly sets one row per id, contact_id pair and pivots the set of response_answer values on question. The aggfunc just assures that if you have multiple answers to the same question in the raw data that we just concatenate them together with spaces. The syntax of pivot_table might vary depending on your pandas version.

Here's a quick example:

In [24]: import pandas as pd

In [25]: import random

In [26]: df = pd.DataFrame({'id':[100*random.randint(10, 50) for _ in range(100)], 'question': [str(random.randint(0,3)) for _ in range(100)], 'response': [str(random.randint(100,120)) for _ in range(100)]})

In [27]: df.head()
Out[27]:
     id question response
0  3100        1      116
1  4500        2      113
2  5000        1      120
3  3900        2      103
4  4300        0      117

In [28]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 99
Data columns (total 3 columns):
id          100 non-null int64
question    100 non-null object
response    100 non-null object
dtypes: int64(1), object(2)
memory usage: 3.1+ KB

In [29]: df.pivot_table(index='id', columns='question', values='response', aggfunc=lambda x: ' '.join(x)).head()
Out[29]:
question        0        1    2        3
id
1000      110 120      NaN  100      NaN
1100          NaN  106 108  104      NaN
1200      104 113      119  NaN      101
1300          102      NaN  116  108 120
1400          NaN      NaN  116      NaN
like image 75
cwharland Avatar answered Oct 07 '22 21:10

cwharland


There are several ways.

1

df1 = df.groupby(["id","contact_id","Network_Name","question"])['response_answer'].aggregate(lambda x: x).unstack().reset_index()
df1.columns=df1.columns.tolist()
print (df1)

2

df1 = df.set_index(["id","contact_id","Network_Name","question"])['response_answer'].unstack().reset_index()
df1.columns=df1.columns.tolist()
print (df1)

3

df1 = df.groupby(["id","contact_id","Network_Name","question"])['response_answer'].aggregate('first').unstack().reset_index()
df1.columns=df1.columns.tolist()
print (df1)

4

df1 = df.pivot_table(index=["id","contact_id","Network_Name"], columns='question', values=['response_answer'], aggfunc='first')
df1.columns = df1.columns.droplevel()
df1 = df1.reset_index()
df1.columns=df1.columns.tolist()
print (df1)

Same ans.

    id  contact_id  Network_Name       City State Trip_End_Location
0   16      137519          2206       None    Ca              None
1   17      137520          2206       None    Ca              None
2   18      137521          2206       None    Ca              None
3   19      137522          2206       None    Ca              None
4   20      137523          2208  Lancaster  None              None
5   21      137524          2208  Lancaster  None              None
6   22      137525          2208  Lancaster  None              None
7   23      137526          2208  Lancaster  None              None
8   24      137527          2208       None  None              Home
9   25      137528          2208       None  None              Home
10  26      137529          2208       None  None              Home
11  27      137530          2208       None  None              Home
like image 29
johnInHome Avatar answered Oct 07 '22 20:10

johnInHome