normalizing data by duplication

Tags:

note: this question is indeed a duplicate of Split pandas dataframe string entry to separate rows, but the answer provided here is more generic and informative, so with all respect due, I chose not to delete the thread

I have a 'dataset' with the following format:

     id | value | ...
--------|-------|------
      a | 156   | ...
    b,c | 457   | ...
e,g,f,h | 346   | ...
    ... | ...   | ...

and I would like to normalize it by duplicating all values for each ids:

     id | value | ...
--------|-------|------
      a | 156   | ...
      b | 457   | ...
      c | 457   | ...
      e | 346   | ...
      g | 346   | ...
      f | 346   | ...
      h | 346   | ...
    ... | ...   | ...

What I'm doing is applying the split-apply-combine principle of pandas using .groupby that creates a tuple for each group (groupby value, pd.DataFrame())

I created a column to group by that simply counts the ids in the row:

df['count_ids'] = df['id'].str.split(',').apply(lambda x: len(x))

     id | value | count_ids
--------|-------|------
      a | 156   | 1
    b,c | 457   | 2
e,g,f,h | 346   | 4
    ... | ...   | ...

The way I'm duplicating the rows is as follows:

pd.DataFrame().append([group]*count_ids)

I'm slowly progressing, but it is really complex, and I would appreciate any best practice or recommendation you can share with this type of problems.

602

asked Aug 22 '16 11:08

Lorinc Nyitrai

1 Answers

try this:

In [44]: df
Out[44]:
        id  value
0        a    156
1      b,c    457
2  e,g,f,h    346

In [45]: (df['id'].str.split(',', expand=True)
   ....:          .stack()
   ....:          .reset_index(level=0)
   ....:          .set_index('level_0')
   ....:          .rename(columns={0:'id'})
   ....:          .join(df.drop('id',1), how='left')
   ....: )
Out[45]:
  id  value
0  a    156
1  b    457
1  c    457
2  e    346
2  g    346
2  f    346
2  h    346

Explanation:

In [48]: df['id'].str.split(',', expand=True).stack()
Out[48]:
0  0    a
1  0    b
   1    c
2  0    e
   1    g
   2    f
   3    h
dtype: object

In [49]: df['id'].str.split(',', expand=True).stack().reset_index(level=0)
Out[49]:
   level_0  0
0        0  a
0        1  b
1        1  c
0        2  e
1        2  g
2        2  f
3        2  h

In [50]: df['id'].str.split(',', expand=True).stack().reset_index(level=0).set_index('level_0')
Out[50]:
         0
level_0
0        a
1        b
1        c
2        e
2        g
2        f
2        h

In [51]: df['id'].str.split(',', expand=True).stack().reset_index(level=0).set_index('level_0').rename(columns={0:'id'})
Out[51]:
        id
level_0
0        a
1        b
1        c
2        e
2        g
2        f
2        h

In [52]: df.drop('id',1)
Out[52]:
   value
0    156
1    457
2    346

168

answered Oct 05 '22 10:10

MaxU - stop WAR against UA

Related questions
                            
                                When/How does an anonymous file object close?
                            
                                Split a pandas column of dictionaries into multiple columns
                            
                                Returning a PDF from S3 in Flask
                            
                                Middleware in flask
                            
                                Vectorizing a Nested Loop
                            
                                Restricted set operations on python dictionary key views
                            
                                Formatted string literals in Python 3.6 with tuples
                            
                                Pandas Divide dataframe by index values
                            
                                pandas merge dataframes on closest timestamp
                            
                                How can I remove all non-alphanumeric characters from a string, except for '#', with regex?
                            
                                How many iterations a needed to train tensorflow with the entire MNIST data set (60000 images)?
                            
                                Changing numpy structured array dtype names and formats
                            
                                unordered_map<int, vector<float>> equivalent in Python
                            
                                Unicode in the standard TensorFlow format
                            
                                python ) google spread sheet : update api does not work with 403
                            
                                Possible to modify/prune learned trees in scikit-learn?
                            
                                Bokeh: Interact with legend label text
                            
                                Ruby's ERB like feature in python
                            
                                Cannot compare type 'Timestamp' with type 'int'
                            
                                PEP 8 E211 issue raised. Not sure why

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

normalizing data by duplication

Tags:

python

pandas

split-apply-combine

Lorinc Nyitrai

People also ask

1 Answers

MaxU - stop WAR against UA

Recent Activity

Donate For Us