I'm trying to take my dataframe from a long format in which I have a column with a categorical variable, into a wide format in which each category has it's own price column. Currently, my data looks like this:
date-time date vendor payment_type price
03-10-15 10:00:00 03-10-15 A1 1 50
03-10-15 10:00:00 03-10-15 A1 2 60
03-10-15 10:00:00 03-11-15 A1 1 45
03-10-15 10:00:00 03-11-15 A1 2 70
03-10-15 10:00:00 03-12-15 B1 1 40
03-10-15 10:00:00 03-12-15 B1 2 45
03-10-15 10:00:00 03-10-15 C1 1 60
03-10-15 10:00:00 03-10-15 C1 1 65
My goal is to have a column for every vendor's price and for each payment type and one row per day. When there are multiple values per day, I want to use the maximum value. The end result should look something like this.
Date A1_Pay1 A2_Pay2 ... C1_Pay1 C1_Pay2
03-10-15 50 60 ... 65 NaN
03-11-15 45 70 ... NaN NaN
03-12-15 NaN NaN ... NaN NaN
I tried using unstack and pivot, but I either wasn't getting what I was going for, or was getting an error about Date not being a unique index.
Any ideas?
The default pandas data types are not the most memory efficient. This is especially true for text data columns with relatively few unique values (commonly referred to as “low-cardinality” data). By using more efficient data types, you can store larger datasets in memory.
The numpy. reshape() function shapes an array without changing the data of the array. Return Type: Array which is reshaped without changing the data.
The short answer is yes, there is a size limit for pandas DataFrames, but it's so large you will likely never have to worry about it. The long answer is the size limit for pandas DataFrames is 100 gigabytes (GB) of memory instead of a set number of cells.
You can use pivot_table
:
#convert column payment_type to string
df['payment_type'] = df['payment_type'].astype(str)
df = pd.pivot_table(df, index='date', columns=['vendor', 'payment_type'], aggfunc=max)
#remove top level of multiindex
df.columns = df.columns.droplevel(0)
#reset multicolumns
df.columns = ['_Pay'.join(col).strip() for col in df.columns.values]
print df
A1_Pay1 A1_Pay2 B1_Pay1 B1_Pay2 C1_Pay1
date
2015-03-10 50 60 NaN NaN 65
2015-03-11 45 70 NaN NaN NaN
2015-03-12 NaN NaN 40 45 NaN
EDIT:
If you need other statistics, you can add them as list to aggfunc
:
#convert column payment_type to string
df['payment_type'] = df['payment_type'].astype(str)
df = pd.pivot_table(df, index='date', columns=['vendor', 'payment_type'],
aggfunc=[np.mean, np.max, np.median])
print df
mean amax median \
price price price
vendor A1 B1 C1 A1 B1 C1 A1 B1
payment_type 1 2 1 2 1 1 2 1 2 1 1 2 1 2
date
2015-03-10 50 60 NaN NaN 62.5 50 60 NaN NaN 65 50 60 NaN NaN
2015-03-11 45 70 NaN NaN NaN 45 70 NaN NaN NaN 45 70 NaN NaN
2015-03-12 NaN NaN 40 45 NaN NaN NaN 40 45 NaN NaN NaN 40 45
vendor C1
payment_type 1
date
2015-03-10 62.5
2015-03-11 NaN
2015-03-12 NaN
#remove top level of multiindex
df.columns = df.columns.droplevel(1)
#reset multicolumns
df.columns = ['_Pay'.join(col).strip() for col in df.columns.values]
print df
mean_PayA1_Pay1 mean_PayA1_Pay2 mean_PayB1_Pay1 \
date
2015-03-10 50 60 NaN
2015-03-11 45 70 NaN
2015-03-12 NaN NaN 40
mean_PayB1_Pay2 mean_PayC1_Pay1 amax_PayA1_Pay1 \
date
2015-03-10 NaN 62.5 50
2015-03-11 NaN NaN 45
2015-03-12 45 NaN NaN
amax_PayA1_Pay2 amax_PayB1_Pay1 amax_PayB1_Pay2 \
date
2015-03-10 60 NaN NaN
2015-03-11 70 NaN NaN
2015-03-12 NaN 40 45
amax_PayC1_Pay1 median_PayA1_Pay1 median_PayA1_Pay2 \
date
2015-03-10 65 50 60
2015-03-11 NaN 45 70
2015-03-12 NaN NaN NaN
median_PayB1_Pay1 median_PayB1_Pay2 median_PayC1_Pay1
date
2015-03-10 NaN NaN 62.5
2015-03-11 NaN NaN NaN
2015-03-12 40 45 NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With