I have a source file with 2 columns: ID and all_dimensions. All dimensions is a string with different "key-value"-pairs which are not the same for each id. I want to make the keys column headers and parse the respective value if existent in the right cell.
ID all_dimensions
12 Height:2 cm,Volume: 4cl,Weight:100g
34 Length: 10cm, Height: 5 cm
56 Depth: 80cm
78 Weight: 2 kg, Length: 7 cm
90 Diameter: 4 cm, Volume: 50 cl
ID Height Volume Weight Length Depth Diameter
12 2 cm 4cl 100g - - -
34 5 cm - - 10cm - -
56 - - - - 80cm -
78 - - 2 kg 7 cm - -
90 - 50 cl - - - 4 cm
I do have over a 100 dimensions so ideally I would like to write a for loop or something similar to not specify each column header (see code examples below) I am using Python 3.7.3 and pandas 0.24.2.
1) I have tried to split the data in separate columns but wasn't sure how to proceed to have each value assigned into the right header:
df.set_index('ID',inplace=True)
newdf = df["all_dimensions"].str.split(",|:",expand = True)
2) Using the initial df, I used "str.extract" to create new columns (but then I would need to specify each header):
df['Volume']=df.all_dimensions.str.extract(r'Volume:([\w\s.]*)').fillna('')
3) To resolve the problem of 2) with each header, I created a list of all dimension attributes and thought to use the list with an for loop to extract the values:
columns_list=df.all_dimensions.str.extract(r'^([\D]*):',expand=True).drop_duplicates()
columns_list=columns_list[0].str.strip().values.tolist()
for dimension in columns_list:
df.dimension=df.all_dimensions.str.extract(r'dimension([\w\s.]*)').fillna('')
Here, JupyterNB gives me a UserWarning: "Pandas doesn't allow columns to be created via a new attribute name" and the df looks the same as before.
Option 1: I prefer splitting several time:
new_series = (df.set_index('ID')
.all_dimensions
.str.split(',', expand=True)
.stack()
.reset_index(level=-1, drop=True)
)
# split second time for individual measurement
new_df = (new_series.str
.split(':', expand=True)
.reset_index()
)
# stripping off leading/trailing spaces
new_df[0] = new_df[0].str.strip()
new_df[1] = new_df[1].str.strip()
# unstack to get the desire table:
new_df.set_index(['ID', 0])[1].unstack()
Option 2: Use split(',|:')
as what you tried:
# splitting
new_series = (df.set_index('ID')
.all_dimensions
.str.split(',|:', expand=True)
.stack()
.reset_index(level=-1, drop=True)
)
# concat along axis=1 to get dataframe with two columns
# new_df.columns = ('ID', 0, 1) where 0 is measurement name
new_df = (pd.concat((new_series[::2].str.strip(),
new_series[1::2]), axis=1)
.reset_index())
new_df.set_index(['ID', 0])[1].unstack()
Output:
Depth Diameter Height Length Volume Weight
ID
12 NaN NaN 2 cm NaN 4cl 100g
34 NaN NaN 5 cm 10cm NaN NaN
56 80cm NaN NaN NaN NaN NaN
78 NaN NaN NaN 7 cm NaN 2 kg
90 NaN 4 cm NaN NaN 50 cl NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With