note: this question is indeed a duplicate of Split pandas dataframe string entry to separate rows, but the answer provided here is more generic and informative, so with all respect due, I chose not to delete the thread
I have a 'dataset' with the following format:
id | value | ...
--------|-------|------
a | 156 | ...
b,c | 457 | ...
e,g,f,h | 346 | ...
... | ... | ...
and I would like to normalize it by duplicating all values for each ids:
id | value | ...
--------|-------|------
a | 156 | ...
b | 457 | ...
c | 457 | ...
e | 346 | ...
g | 346 | ...
f | 346 | ...
h | 346 | ...
... | ... | ...
What I'm doing is applying the split-apply-combine principle of pandas
using .groupby
that creates a tuple
for each group (groupby value, pd.DataFrame())
I created a column to group by that simply counts the ids in the row:
df['count_ids'] = df['id'].str.split(',').apply(lambda x: len(x))
id | value | count_ids
--------|-------|------
a | 156 | 1
b,c | 457 | 2
e,g,f,h | 346 | 4
... | ... | ...
The way I'm duplicating the rows is as follows:
pd.DataFrame().append([group]*count_ids)
I'm slowly progressing, but it is really complex, and I would appreciate any best practice or recommendation you can share with this type of problems.
Basically, normalization is the process of efficiently organising data in a database. There are two main objectives of the normalization process: eliminate redundant data (storing the same data in more than one table) and ensure data dependencies make sense (only storing related data in a table).
Eliminate repeating groups in individual tables. Create a separate table for each set of related data. Identify each set of related data with a primary key.
try this:
In [44]: df
Out[44]:
id value
0 a 156
1 b,c 457
2 e,g,f,h 346
In [45]: (df['id'].str.split(',', expand=True)
....: .stack()
....: .reset_index(level=0)
....: .set_index('level_0')
....: .rename(columns={0:'id'})
....: .join(df.drop('id',1), how='left')
....: )
Out[45]:
id value
0 a 156
1 b 457
1 c 457
2 e 346
2 g 346
2 f 346
2 h 346
Explanation:
In [48]: df['id'].str.split(',', expand=True).stack()
Out[48]:
0 0 a
1 0 b
1 c
2 0 e
1 g
2 f
3 h
dtype: object
In [49]: df['id'].str.split(',', expand=True).stack().reset_index(level=0)
Out[49]:
level_0 0
0 0 a
0 1 b
1 1 c
0 2 e
1 2 g
2 2 f
3 2 h
In [50]: df['id'].str.split(',', expand=True).stack().reset_index(level=0).set_index('level_0')
Out[50]:
0
level_0
0 a
1 b
1 c
2 e
2 g
2 f
2 h
In [51]: df['id'].str.split(',', expand=True).stack().reset_index(level=0).set_index('level_0').rename(columns={0:'id'})
Out[51]:
id
level_0
0 a
1 b
1 c
2 e
2 g
2 f
2 h
In [52]: df.drop('id',1)
Out[52]:
value
0 156
1 457
2 346
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With