Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unmelt only part of a column from pandas dataframe

I have the following example dataframe:

df = pd.DataFrame(data = {'RecordID' : [1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5], 'DisplayLabel' : ['Source','Test','Value 1','Value 2','Value3','Source','Test','Value 1','Value 2','Source','Test','Value 1','Value 2','Source','Test','Value 1','Value 2','Source','Test','Value 1','Value 2'],
'Value' : ['Web','Logic','S','I','Complete','Person','Voice','>20','P','Mail','OCR','A','I','Dictation','Understandable','S','I','Web','Logic','R','S']})

which creates this dataframe:

+-------+----------+---------------+----------------+
| Index | RecordID | Display Label |     Value      |
+-------+----------+---------------+----------------+
|     0 |        1 | Source        | Web            |
|     1 |        1 | Test          | Logic          |
|     2 |        1 | Value 1       | S              |
|     3 |        1 | Value 2       | I              |
|     4 |        1 | Value 3       | Complete       |
|     5 |        2 | Source        | Person         |
|     6 |        2 | Test          | Voice          |
|     7 |        2 | Value 1       | >20            |
|     8 |        2 | Value 2       | P              |
|     9 |        3 | Source        | Mail           |
|    10 |        3 | Test          | OCR            |
|    11 |        3 | Value 1       | A              |
|    12 |        3 | Value 2       | I              |
|    13 |        4 | Source        | Dictation      |
|    14 |        4 | Test          | Understandable |
|    15 |        4 | Value 1       | S              |
|    16 |        4 | Value 2       | I              |
|    17 |        5 | Source        | Web            |
|    18 |        5 | Test          | Logic          |
|    19 |        5 | Value 1       | R              |
|    20 |        5 | Value 2       | S              |
+-------+----------+---------------+----------------+

I am trying to "unmelt" though not exactly the source and test columns into new dataframe Columns such that it will look like this:

+-------+----------+-----------+----------------+---------------+----------+
| Index | RecordID |  Source   |      Test      | Result        |  Value   |
+-------+----------+-----------+----------------+---------------+----------+
|     0 |        1 | Web       | Logic          | Value 1       | S        |
|     1 |        1 | Web       | Logic          | Value 2       | I        |
|     2 |        1 | Web       | Logic          | Value 3       | Complete |
|     3 |        2 | Person    | Voice          | Value 1       | >20      |
|     4 |        2 | Person    | Voice          | Value 2       | P        |
|     5 |        3 | Mail      | OCR            | Value 1       | A        |
|     6 |        3 | Mail      | OCR            | Value 2       | I        |
|     7 |        4 | Dictation | Understandable | Value 1       | S        |
|     8 |        4 | Dictation | Understandable | Value 2       | I        |
|     9 |        5 | Web       | Logic          | Value 1       | R        |
|    10 |        5 | Web       | Logic          | Value 2       | S        |
+-------+----------+-----------+----------------+---------------+----------+

It's my understanding that pivot and melt will do the entire DisplayLabel column and not just some of the values.

Any help would be greatly appreciated as I have read the Pandas Melt and the Pandas Pivot as well as some references on stackoverflow and I can't seem to figure out a way to do this quickly.

Thanks!

like image 793
Jon Avatar asked Jan 21 '20 18:01

Jon


1 Answers

We can achieve your result by applying logic and pivotting, we split your data by checking if DisplayLabel contains Value and then we join them back together:

mask = df['DisplayLabel'].str.contains('Value')
df2 = df[~mask].pivot(index='RecordID', columns='DisplayLabel', values='Value')

dfpiv = (
    df[mask].rename(columns={'DisplayLabel':'Result'})
            .set_index('RecordID')
            .join(df2)
            .reset_index()
)
    RecordID   Result     Value     Source            Test
0          1  Value 1         S        Web           Logic
1          1  Value 2         I        Web           Logic
2          1   Value3  Complete        Web           Logic
3          2  Value 1       >20     Person           Voice
4          2  Value 2         P     Person           Voice
5          3  Value 1         A       Mail             OCR
6          3  Value 2         I       Mail             OCR
7          4  Value 1         S  Dictation  Understandable
8          4  Value 2         I  Dictation  Understandable
9          5  Value 1         R        Web           Logic
10         5  Value 2         S        Web           Logic

If you want the exact column order as your example, use DataFrame.reindex:

dfpiv.reindex(columns=['RecordID', 'Source', 'Test', 'Result', 'Value'])

    RecordID     Source            Test   Result     Value
0          1        Web           Logic  Value 1         S
1          1        Web           Logic  Value 2         I
2          1        Web           Logic   Value3  Complete
3          2     Person           Voice  Value 1       >20
4          2     Person           Voice  Value 2         P
5          3       Mail             OCR  Value 1         A
6          3       Mail             OCR  Value 2         I
7          4  Dictation  Understandable  Value 1         S
8          4  Dictation  Understandable  Value 2         I
9          5        Web           Logic  Value 1         R
10         5        Web           Logic  Value 2         S

In detail - step by step:

# mask all rows where "Value" is in column DisplayLabel
mask = df['DisplayLabel'].str.contains('Value')

0     False
1     False
2      True
3      True
4      True
5     False
6     False
7      True
8      True
9     False
10    False
11     True
12     True
13    False
14    False
15     True
16     True
17    False
18    False
19     True
20     True
Name: DisplayLabel, dtype: bool
# select all rows which do NOT have "Value" in DisplayLabel
df[~mask]

    RecordID DisplayLabel           Value
0          1       Source             Web
1          1         Test           Logic
5          2       Source          Person
6          2         Test           Voice
9          3       Source            Mail
10         3         Test             OCR
13         4       Source       Dictation
14         4         Test  Understandable
17         5       Source             Web
18         5         Test           Logic
# pivot the values in DisplayLabel to columns
df2 = df[~mask].pivot(index='RecordID', columns='DisplayLabel', values='Value')

DisplayLabel     Source            Test
RecordID                               
1                   Web           Logic
2                Person           Voice
3                  Mail             OCR
4             Dictation  Understandable
5                   Web           Logic
df[mask].rename(columns={'DisplayLabel':'Result'}) # rename the column DisplayLabel to Result
            .set_index('RecordID')                 # set RecordId as index so we can join df2 
            .join(df2)                             # join df2 back to our dataframe based RecordId
            .reset_index()                         # reset index so we get RecordId back as column

    RecordID   Result     Value     Source            Test
0          1  Value 1         S        Web           Logic
1          1  Value 2         I        Web           Logic
2          1   Value3  Complete        Web           Logic
3          2  Value 1       >20     Person           Voice
4          2  Value 2         P     Person           Voice
5          3  Value 1         A       Mail             OCR
6          3  Value 2         I       Mail             OCR
7          4  Value 1         S  Dictation  Understandable
8          4  Value 2         I  Dictation  Understandable
9          5  Value 1         R        Web           Logic
10         5  Value 2         S        Web           Logic
like image 107
Erfan Avatar answered Oct 27 '22 13:10

Erfan