I have a data frame that looks like this:
USUBJID IDVAR IDVARVAL QNAM QVAL
0 Dummy-01-0001 AESEQ 1.0 AEdummy1 2012-02-15
1 Dummy-01-0002 AESEQ 1.0 AEdummy1 2012-02-23
2 Dummy-01-0004 AESEQ 1.0 AEdummy1 2012-02-06
3 Dummy-01-0004 AESEQ 2.0 AEdummy1 2012-03-10
4 Dummy-01-0005 AESEQ 1.0 AEdummy1 2012-03-10
5 Dummy-01-0001 AESPID 1 dummy2AE Gastrointestinal disorders
6 Dummy-01-0002 AESPID 1 dummy2AE Nervous system disorder
7 Dummy-01-0004 AESPID 2 dummy2AE Gastrointestinal disorders
8 Dummy-01-0004 AESPID 1 dummy2AE Nervous system disorder
9 Dummy-01-0005 AESPID 1 dummy2AE Gastrointestinal disorders
Above dataframe is just a sample, please use following code to get some more data into dataframe.
df = pd.DataFrame({'USUBJID': {0: 'Dummy-01-0001', 1: 'Dummy-01-0002', 2: 'Dummy-01-0004', 3: 'Dummy-01-0004', 4: 'Dummy-01-0005', 5: 'Dummy-01-0007', 6: 'Dummy-01-0008', 7: 'Dummy-01-0008', 8: 'Dummy-01-0010', 9: 'Dummy-01-0010', 10: 'Dummy-01-0011', 11: 'Dummy-01-0011', 12: 'Dummy-01-0013', 13: 'Dummy-01-0013', 14: 'Dummy-01-0014', 15: 'Dummy-01-0016', 16: 'Dummy-01-0016', 17: 'Dummy-01-0019', 18: 'Dummy-01-0001', 19: 'Dummy-01-0002', 20: 'Dummy-01-0004', 21: 'Dummy-01-0004', 22: 'Dummy-01-0005', 23: 'Dummy-01-0007', 24: 'Dummy-01-0008', 25: 'Dummy-01-0008', 26: 'Dummy-01-0010', 27: 'Dummy-01-0010', 28: 'Dummy-01-0011', 29: 'Dummy-01-0011', 30: 'Dummy-01-0013', 31: 'Dummy-01-0013', 32: 'Dummy-01-0014', 33: 'Dummy-01-0016', 34: 'Dummy-01-0016', 35: 'Dummy-01-0017', 36: 'Dummy-01-0017', 37: 'Dummy-01-0019'}, 'IDVAR': {0: 'AESEQ', 1: 'AESEQ', 2: 'AESEQ', 3: 'AESEQ', 4: 'AESEQ', 5: 'AESEQ', 6: 'AESEQ', 7: 'AESEQ', 8: 'AESEQ', 9: 'AESEQ', 10: 'AESEQ', 11: 'AESEQ', 12: 'AESEQ', 13: 'AESEQ', 14: 'AESEQ', 15: 'AESEQ', 16: 'AESEQ', 17: 'AESEQ', 18: 'AESPID', 19: 'AESPID', 20: 'AESPID', 21: 'AESPID', 22: 'AESPID', 23: 'AESPID', 24: 'AESPID', 25: 'AESPID', 26: 'AESPID', 27: 'AESPID', 28: 'AESPID', 29: 'AESPID', 30: 'AESPID', 31: 'AESPID', 32: 'AESPID', 33: 'AESPID', 34: 'AESPID', 35: 'AESPID', 36: 'AESPID', 37: 'AESPID'}, 'IDVARVAL': {0: '1.0', 1: '1.0', 2: '1.0', 3: '2.0', 4: '1.0', 5: '1.0', 6: '1.0', 7: '2.0', 8: '1.0', 9: '2.0', 10: '1.0', 11: '2.0', 12: '1.0', 13: '2.0', 14: '1.0', 15: '1.0', 16: '2.0', 17: '1.0', 18: '1', 19: '1', 20: '2', 21: '1', 22: '1', 23: '1', 24: '1', 25: '2', 26: '2', 27: '1', 28: '1', 29: '2', 30: '1', 31: '2', 32: '1', 33: '1', 34: '2', 35: '1', 36: '2', 37: '1'}, 'QNAM': {0: 'AEdummy1', 1: 'AEdummy1', 2: 'AEdummy1', 3: 'AEdummy1', 4: 'AEdummy1', 5: 'AEdummy1', 6: 'AEdummy1', 7: 'AEdummy1', 8: 'AEdummy1', 9: 'AEdummy1', 10: 'AEdummy1', 11: 'AEdummy1', 12: 'AEdummy1', 13: 'AEdummy1', 14: 'AEdummy1', 15: 'AEdummy1', 16: 'AEdummy1', 17: 'AEdummy1', 18: 'dummy2AE', 19: 'dummy2AE', 20: 'dummy2AE', 21: 'dummy2AE', 22: 'dummy2AE', 23: 'dummy2AE', 24: 'dummy2AE', 25: 'dummy2AE', 26: 'dummy2AE', 27: 'dummy2AE', 28: 'dummy2AE', 29: 'dummy2AE', 30: 'dummy2AE', 31: 'dummy2AE', 32: 'dummy2AE', 33: 'dummy2AE', 34: 'dummy2AE', 35: 'dummy2AE', 36: 'dummy2AE', 37: 'dummy2AE'}, 'QVAL': {0: '2012-02-15', 1: '2012-02-23', 2: '2012-02-06', 3: '2012-03-10', 4: '2012-03-10', 5: '2012-02-08', 6: '2012-03-18', 7: '2012-03-07', 8: '2012-02-01', 9: '2012-01-10', 10: '2012-01-19', 11: '2012-03-28', 12: '2012-02-19', 13: '2012-02-14', 14: '2012-03-13', 15: '2012-03-08', 16: '2012-02-05', 17: '2012-03-18', 18: 'Gastrointestinal disorders', 19: 'Nervous system disorder', 20: 'Gastrointestinal disorders', 21: 'Nervous system disorder', 22: 'Gastrointestinal disorders', 23: 'Vascular disorders', 24: 'Gastrointestinal disorders', 25: 'Vascular disorders', 26: 'Nervous system disorder', 27: 'Gastrointestinal disorders', 28: 'Nervous system disorder', 29: 'Nervous system disorder', 30: 'Nervous system disorder', 31: 'Gastrointestinal disorders', 32: 'Gastrointestinal disorders', 33: 'Vascular disorders', 34: 'Gastrointestinal disorders', 35: 'Nervous system disorder', 36: 'Nervous system disorder', 37: 'Vascular disorders'}})
Let's understand the structure a bit. IDVAR
holds the variable name and IDVARVAL
holds its values, and QNAM
holds another variable name and QVAL
holds the corresponding value.
This structure, in the domain I'm working is called normalized structure, I know it's weird though.
I want to get this data in the following form:
USUBJID AESEQ AESPID AEdummy1 dummy2AE
0 Dummy-01-0001 1.0 1 2012-02-15 Gastrointestinal disorders
1 Dummy-01-0002 1.0 1 2012-02-23 Nervous system disorder
2 Dummy-01-0004 1.0 2 2012-02-06 Gastrointestinal disorders
3 Dummy-01-0004 2.0 1 2012-03-10 Nervous system disorder
4 Dummy-01-0005 1.0 1 2012-03-10 Gastrointestinal disorders
If I'd have to do it for QNAM
and QVAL
only, there is no problem, I can easily do it using pandas' pivot as follows:
df.pivot(
index=['USUBJID', 'IDVARVAL', 'IDVAR'],
columns='QNAM',
values='QVAL'
).reset_index()
Even for this problem, I can use some sort of masking, but I know that is not going to be the most efficient approach since this data is going to have thousands or sometimes millions of records.
Please be noted that: The combination of USUBJID
, and IDVARVAL
for an IDVAR
should map to the value of QVAL
for a QNAM
. That being said, some of the records for AESEQ
, AESPID
, AEdummy1
, and dummy2AE
could be NaN or empty for the output DataFrame. In other words, for the above sample, Dummy-01-0001
and AESEQ=1.0
uniquely identifies AEdummy1 = 2012-02-15
.
Let us try with unstacking
followed by concat
:
s = df.set_index([df.groupby('IDVAR').cumcount(), 'USUBJID'])
s1 = s.set_index('IDVAR', append=True)['IDVARVAL'].unstack()
s2 = s.set_index('QNAM', append=True)['QVAL'].unstack()
out = pd.concat([s1, s2], axis=1).reset_index(level=1)
Group the dataframe on IDVAR
and use cumcount
to create a sequential counter to uniquely identify the rows per IDVAR
group, then set this counter along with the column USUBJID
as the index of the dataframe:
>>> s
IDVAR IDVARVAL QNAM QVAL
USUBJID
0 Dummy-01-0001 AESEQ 1.0 AEdummy1 2012-02-15
1 Dummy-01-0002 AESEQ 1.0 AEdummy1 2012-02-23
2 Dummy-01-0004 AESEQ 1.0 AEdummy1 2012-02-06
3 Dummy-01-0004 AESEQ 2.0 AEdummy1 2012-03-10
4 Dummy-01-0005 AESEQ 1.0 AEdummy1 2012-03-10
0 Dummy-01-0001 AESPID 1.0 dummy2AE Gastrointestinal disorders
1 Dummy-01-0002 AESPID 1.0 dummy2AE Nervous system disorder
2 Dummy-01-0004 AESPID 2.0 dummy2AE Gastrointestinal disorders
3 Dummy-01-0004 AESPID 1.0 dummy2AE Nervous system disorder
4 Dummy-01-0005 AESPID 1.0 dummy2AE Gastrointestinal disorders
Now for each columns IDVARVAL
and QVAL
update the index by appending the corresponding columns IDVAR
and QNAM
then unstack
to reshape:
>>> s1
IDVAR AESEQ AESPID
USUBJID
0 Dummy-01-0001 1.0 1.0
1 Dummy-01-0002 1.0 1.0
2 Dummy-01-0004 1.0 2.0
3 Dummy-01-0004 2.0 1.0
4 Dummy-01-0005 1.0 1.0
>>> s2
QNAM AEdummy1 dummy2AE
USUBJID
0 Dummy-01-0001 2012-02-15 Gastrointestinal disorders
1 Dummy-01-0002 2012-02-23 Nervous system disorder
2 Dummy-01-0004 2012-02-06 Gastrointestinal disorders
3 Dummy-01-0004 2012-03-10 Nervous system disorder
4 Dummy-01-0005 2012-03-10 Gastrointestinal disorders
Finally, concat
the above unstacked frames s1
and s2
along axis=1
to get the desired result:
>>> out
USUBJID AESEQ AESPID AEdummy1 dummy2AE
0 Dummy-01-0001 1.0 1.0 2012-02-15 Gastrointestinal disorders
1 Dummy-01-0002 1.0 1.0 2012-02-23 Nervous system disorder
2 Dummy-01-0004 1.0 2.0 2012-02-06 Gastrointestinal disorders
3 Dummy-01-0004 2.0 1.0 2012-03-10 Nervous system disorder
4 Dummy-01-0005 1.0 1.0 2012-03-10 Gastrointestinal disorders
You could reshape the data with pivot_longer from pyjanitor, create a counter with cumcount to create a unique index, pivot the data, and do some final cleanup to get something similar to your expected output:
(
df.pivot_longer(["USUBJID"],
names_pattern=["IDVAR$|QNAM$", "IDVARVAL$|QVAL$"],
names_to=["variables", "values"])
.assign(counter=lambda df: df.groupby(["USUBJID", "variables"]).cumcount())
.pivot(["USUBJID", "counter"], "variables", "values")
.droplevel("counter")
.rename_axis(columns=None)
.reset_index()
)
USUBJID AESEQ AESPID AEdummy1 dummy2AE
0 Dummy-01-0001 1.0 1.0 2012-02-15 Gastrointestinal disorders
1 Dummy-01-0002 1.0 1.0 2012-02-23 Nervous system disorder
2 Dummy-01-0004 1.0 2.0 2012-02-06 Gastrointestinal disorders
3 Dummy-01-0004 2.0 1.0 2012-03-10 Nervous system disorder
4 Dummy-01-0005 1.0 1.0 2012-03-10 Gastrointestinal disorders
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With