Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Have Pandas column containing lists, how to pivot unique list elements to columns?

I wrote a web scraper to pull information from a table of products and build a dataframe. The data table has a Description column which contains a comma separated string of attributes describing the product. I want to create a column in the dataframe for every unique attribute and populate the row in that column with the attribute's substring. Example df below.

PRODUCTS     DATE        DESCRIPTION
Product A    2016-9-12   Steel, Red, High Hardness
Product B    2016-9-11   Blue, Lightweight, Steel
Product C    2016-9-12   Red

I figure the first step is to split the description into a list.

In: df2 = df['DESCRIPTION'].str.split(',')

Out:
DESCRIPTION
['Steel', 'Red', 'High Hardness']
['Blue', 'Lightweight', 'Steel']
['Red']

My desired output looks like the table below. The column names are not particularly important.

PRODUCTS     DATE        STEEL_COL  RED_COL    HIGH HARDNESS_COL  BLUE COL   LIGHTWEIGHT_COL
Product A    2016-9-12   Steel      Red        High Hardness
Product B    2016-9-11   Steel                                    Blue       Lightweight
Product C    2016-9-12              Red

I believe the columns can be set up using a Pivot but I'm not sure the most Pythonic way to populate the columns after establishing them. Any help is appreciated.

UPDATE

Thank you very much for the answers. I selected @MaxU's response as correct since it seems slightly more flexible, but @piRSquared's gets a very similar result and may even be considered the more Pythonic approach. I tested both version and both do what I needed. Thanks!

like image 961
C. Hale Avatar asked Sep 12 '16 21:09

C. Hale


1 Answers

you can build up a sparse matrix:

In [27]: df
Out[27]:
    PRODUCTS       DATE                DESCRIPTION
0  Product A  2016-9-12  Steel, Red, High Hardness
1  Product B  2016-9-11   Blue, Lightweight, Steel
2  Product C  2016-9-12                        Red

In [28]: (df.set_index(['PRODUCTS','DATE'])
   ....:    .DESCRIPTION.str.split(',\s*', expand=True)
   ....:    .stack()
   ....:    .reset_index()
   ....:    .pivot_table(index=['PRODUCTS','DATE'], columns=0, fill_value=0, aggfunc='size')
   ....: )
Out[28]:
0                    Blue  High Hardness  Lightweight  Red  Steel
PRODUCTS  DATE
Product A 2016-9-12     0              1            0    1      1
Product B 2016-9-11     1              0            1    0      1
Product C 2016-9-12     0              0            0    1      0

In [29]: (df.set_index(['PRODUCTS','DATE'])
   ....:    .DESCRIPTION.str.split(',\s*', expand=True)
   ....:    .stack()
   ....:    .reset_index()
   ....:    .pivot_table(index=['PRODUCTS','DATE'], columns=0, fill_value='', aggfunc='size')
   ....: )
Out[29]:
0                   Blue High Hardness Lightweight Red Steel
PRODUCTS  DATE
Product A 2016-9-12                  1               1     1
Product B 2016-9-11    1                         1         1
Product C 2016-9-12                                  1
like image 93
MaxU - stop WAR against UA Avatar answered Oct 13 '22 08:10

MaxU - stop WAR against UA