I currently have a data set that tracks completed 5 tests however, it only shows those who have completed the test, not those who yet to take it - example below:
Name Test Completed
John Math-Test1 Yes
John Math-Test2 Yes
John Math-Test3 Yes
John Math-Test4 Yes
John Math-Test5 Yes
Lauren Math-Test1 Yes
Lauren Math-Test2 Yes
Lauren Math-Test3 Yes
Tom Math-Test1 Yes
Tom Math-Test2 Yes
Tom Math-Test3 Yes
Tom Math-Test4 Yes
Tom Math-Test5 Yes
As you can see, Lauren has not yet taken the tests 'Math-Test4' and 'Math-Test5', so her name doesn't appear. I would like to add an option to have the 'Completed' column say 'No' when someone has not completed a test.
Desired output is below:
Name Test Completed
John Math-Test1 Yes
John Math-Test2 Yes
John Math-Test3 Yes
John Math-Test4 Yes
John Math-Test5 Yes
Lauren Math-Test1 Yes
Lauren Math-Test2 Yes
Lauren Math-Test3 Yes
*Lauren Math-Test4 No* - Add these rows automatically
*Lauren Math-Test5 No*
Tom Math-Test1 Yes
Tom Math-Test2 Yes
Tom Math-Test3 Yes
Tom Math-Test4 Yes
Tom Math-Test5 Yes
How could this be achieved with Python/Pandas/Numpy?
Thanks for all who can assist!
Edit - Update: Upon trying @Scott Boston's code I get this out:
idx = pd.MultiIndex.from_product([df['Name'].unique(),
df['Test'].unique()],
names=['Name','Test'])
newidx = idx[~idx.isin(df.set_index(['Name','Test']).index)]
pd.concat([df,
newidx.to_series().reset_index().assign(Completed="No*")[['Name','Test','Completed']]], ignore_index=True)
Output:
Name1 Test Completed
John Math-Test1 Yes
John Math-Test2 Yes
John Math-Test3 Yes
John Math-Test4 Yes
John Math-Test5 Yes
Lauren Math-Test1 Yes
Lauren Math-Test2 Yes
Lauren Math-Test3 Yes
Tom Math-Test1 Yes
Tom Math-Test2 Yes
Tom Math-Test3 Yes
Tom Math-Test4 Yes
Tom Math-Test5 Yes
John Math-Test3 No*
John Math-Test4 No*
John Math-Test5 No*
John Math-Test2 No*
Lauren Math-Test3 No*
Lauren Math-Test4 No*
Lauren Math-Test5 No*
Lauren Math-Test2 No*
Lauren Math-Test5 No*
Lauren Math-Test1 No*
Lauren Math-Test2 No*
Lauren Math-Test4 No*
Lauren Math-Test5 No*
Now just need to find way to remove unwanted rows for the desired output.
Try, let's use multiindex with from_product
, set_index
, and reindex
,
This method works for all "seen" values, if a value isn't seen, then you'll need to use hardcoded list in the from_product method:
idx = pd.MultiIndex.from_product([df['Name'].unique(),
df['Test'].unique()],
names=['Name','Test'])
df.set_index(['Name','Test']).reindex(idx, fill_value='No*').reset_index()
Output:
Name Test Completed
0 John Math-Test1 Yes
1 John Math-Test2 Yes
2 John Math-Test3 Yes
3 John Math-Test4 Yes
4 John Math-Test5 Yes
5 Lauren Math-Test1 Yes
6 Lauren Math-Test2 Yes
7 Lauren Math-Test3 Yes
8 Lauren Math-Test4 No*
9 Lauren Math-Test5 No*
10 Tom Math-Test1 Yes
11 Tom Math-Test2 Yes
12 Tom Math-Test3 Yes
13 Tom Math-Test4 Yes
14 Tom Math-Test5 Yes
Update
idx = pd.MultiIndex.from_product([df['Name'].unique(),
df['Test'].unique()],
names=['Name','Test'])
newidx = idx[~idx.isin(df.set_index(['Name','Test']).index)]
pd.concat([df,
newidx.to_series().reset_index().assign(Completed="No*")[['Name','Test','Completed']]], sort=True, ignore_index=True)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With