I have a dataframe (df) that looks like this:
+---------+-------+------------+----------+ | subject | pills | date | strength | +---------+-------+------------+----------+ | 1 | 4 | 10/10/2012 | 250 | | 1 | 4 | 10/11/2012 | 250 | | 1 | 2 | 10/12/2012 | 500 | | 2 | 1 | 1/6/2014 | 1000 | | 2 | 1 | 1/7/2014 | 250 | | 2 | 1 | 1/7/2014 | 500 | | 2 | 3 | 1/8/2014 | 250 | +---------+-------+------------+----------+
When I use reshape in R, I get what I want:
reshape(df, idvar = c("subject","date"), timevar = 'strength', direction = "wide") +---------+------------+--------------+--------------+---------------+ | subject | date | strength.250 | strength.500 | strength.1000 | +---------+------------+--------------+--------------+---------------+ | 1 | 10/10/2012 | 4 | NA | NA | | 1 | 10/11/2012 | 4 | NA | NA | | 1 | 10/12/2012 | NA | 2 | NA | | 2 | 1/6/2014 | NA | NA | 1 | | 2 | 1/7/2014 | 1 | 1 | NA | | 2 | 1/8/2014 | 3 | NA | NA | +---------+------------+--------------+--------------+---------------+
Using pandas:
df.pivot_table(df, index=['subject','date'],columns='strength') +---------+------------+-------+----+-----+ | | | pills | +---------+------------+-------+----+-----+ | | strength | 250 | 500| 1000| +---------+------------+-------+----+-----+ | subject | date | | | | +---------+------------+-------+----+-----+ | 1 | 10/10/2012 | 4 | NA | NA | | | 10/11/2012 | 4 | NA | NA | | | 10/12/2012 | NA | 2 | NA | +---------+------------+-------+----+-----+ | 2 | 1/6/2014 | NA | NA | 1 | | | 1/7/2014 | 1 | 1 | NA | | | 1/8/2014 | 3 | NA | NA | +---------+------------+-------+----+-----+
How do I get exactly the same output as in R with pandas? I only want 1 header.
DataFrame - pivot_table() function The pivot_table() function is used to create a spreadsheet-style pivot table as a DataFrame. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.
Remove All Duplicate Rows from Pandas DataFrame You can set 'keep=False' in the drop_duplicates() function to remove all the duplicate rows. For E.x, df. drop_duplicates(keep=False) .
Return a copy of the array collapsed into one dimension. Whether to flatten in C (row-major), Fortran (column-major) order, or preserve the C/Fortran ordering from a . The default is 'C'.
After pivoting, convert the dataframe to records and then back to dataframe:
flattened = pd.DataFrame(pivoted.to_records()) # subject date ('pills', 250) ('pills', 500) ('pills', 1000) #0 1 10/10/2012 4.0 NaN NaN #1 1 10/11/2012 4.0 NaN NaN #2 1 10/12/2012 NaN 2.0 NaN #3 2 1/6/2014 NaN NaN 1.0 #4 2 1/7/2014 1.0 1.0 NaN #5 2 1/8/2014 3.0 NaN NaN
You can now "repair" the column names, if you want:
flattened.columns = [hdr.replace("('pills', ", "strength.").replace(")", "") \ for hdr in flattened.columns] flattened # subject date strength.250 strength.500 strength.1000 #0 1 10/10/2012 4.0 NaN NaN #1 1 10/11/2012 4.0 NaN NaN #2 1 10/12/2012 NaN 2.0 NaN #3 2 1/6/2014 NaN NaN 1.0 #4 2 1/7/2014 1.0 1.0 NaN #5 2 1/8/2014 3.0 NaN NaN
It's awkward, but it works.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With