Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Extract dimension data from dataframe string column and create columns with values for each of them

Hej,

I have a source file with 2 columns: ID and all_dimensions. All dimensions is a string with different "key-value"-pairs which are not the same for each id. I want to make the keys column headers and parse the respective value if existent in the right cell.

Example:

ID  all_dimensions
12  Height:2 cm,Volume: 4cl,Weight:100g
34  Length: 10cm, Height: 5 cm
56  Depth: 80cm
78  Weight: 2 kg, Length: 7 cm
90  Diameter: 4 cm, Volume: 50 cl

Desired result:

ID  Height  Volume  Weight  Length  Depth  Diameter 
12  2 cm     4cl     100g      -      -        -
34  5 cm      -        -     10cm     -        -
56    -       -        -      -      80cm      -
78    -       -      2 kg    7 cm     -        -
90    -     50 cl     -       -      -        4 cm

I do have over a 100 dimensions so ideally I would like to write a for loop or something similar to not specify each column header (see code examples below) I am using Python 3.7.3 and pandas 0.24.2.

What have I tried already:

1) I have tried to split the data in separate columns but wasn't sure how to proceed to have each value assigned into the right header:

df.set_index('ID',inplace=True)
newdf = df["all_dimensions"].str.split(",|:",expand = True)

2) Using the initial df, I used "str.extract" to create new columns (but then I would need to specify each header):

df['Volume']=df.all_dimensions.str.extract(r'Volume:([\w\s.]*)').fillna('')

3) To resolve the problem of 2) with each header, I created a list of all dimension attributes and thought to use the list with an for loop to extract the values:

columns_list=df.all_dimensions.str.extract(r'^([\D]*):',expand=True).drop_duplicates()
columns_list=columns_list[0].str.strip().values.tolist()
for dimension in columns_list:
    df.dimension=df.all_dimensions.str.extract(r'dimension([\w\s.]*)').fillna('')

Here, JupyterNB gives me a UserWarning: "Pandas doesn't allow columns to be created via a new attribute name" and the df looks the same as before.

like image 440
Annina Avatar asked Jun 03 '19 15:06

Annina


1 Answers

Option 1: I prefer splitting several time:

new_series = (df.set_index('ID')
                .all_dimensions
                .str.split(',', expand=True)
                .stack()
                .reset_index(level=-1, drop=True)
             )

# split second time for individual measurement
new_df = (new_series.str
                    .split(':', expand=True)
                    .reset_index()
                    )

# stripping off leading/trailing spaces
new_df[0] = new_df[0].str.strip()
new_df[1] = new_df[1].str.strip()

# unstack to get the desire table:
new_df.set_index(['ID', 0])[1].unstack()

Option 2: Use split(',|:') as what you tried:

# splitting
new_series = (df.set_index('ID')
                .all_dimensions
                .str.split(',|:', expand=True)
                .stack()
                .reset_index(level=-1, drop=True)
             )

# concat along axis=1 to get dataframe with two columns 
# new_df.columns = ('ID', 0, 1) where 0 is measurement name
new_df = (pd.concat((new_series[::2].str.strip(), 
                     new_series[1::2]), axis=1)
            .reset_index())

new_df.set_index(['ID', 0])[1].unstack()

Output:

    Depth   Diameter    Height  Length  Volume  Weight
ID                      
12  NaN     NaN     2 cm    NaN     4cl     100g
34  NaN     NaN     5 cm    10cm    NaN     NaN
56  80cm    NaN     NaN     NaN     NaN     NaN
78  NaN     NaN     NaN     7 cm    NaN     2 kg
90  NaN     4 cm    NaN     NaN     50 cl   NaN
like image 140
Quang Hoang Avatar answered Oct 02 '22 14:10

Quang Hoang