I have a dataframe df that has the columns ['metric_type', 'metric_value'].  For each row, I want to make sure I have a column with the name equal to 'metric_type' and a value for that column equal to 'metric_value'.
One of my problems is that 'metric_type' has spurious spaces that I want to get rid of.
Consider the dataframe df:
df = pd.DataFrame([
        ['a ', 1],
        [' b', 2],
        [' c ', 3]
    ], columns=['metric_type', 'metric_value'])
print(df)
  metric_type  metric_value
0          a              1
1           b             2
2          c              3
Notice that each value of 'metric_type' has spaces in varying places.
I created a function to use apply but it takes a horribly long time.
def assign_metric_vals(row):
    row[row['metric_type'].replace(" ", "")] = row['metric_value']
    return row
When I use it, I get this:
       a    b    c metric_type  metric_value
0 1.0000  nan  nan          a              1
1    nan 2.00  nan           b             2
2    nan  nan 3.00          c              3
Is there a better (read, "faster") way to accomplish this same task?
You're much better served setting your index with metric_type and unstacking.
df.set_index(df.metric_type.str.replace(' ', ''), append=True).metric_value.unstack()
demonstration
df = pd.DataFrame([
        ['a ', 1],
        [' b', 2],
        [' c ', 3]
    ], columns=['metric_type', 'metric_value'])
print(df)
  metric_type  metric_value
0          a              1
1           b             2
2          c              3
print(df.apply(assign_metric_vals, 1))
       a    b    c metric_type  metric_value
0 1.0000  nan  nan          a              1
1    nan 2.00  nan           b             2
2    nan  nan 3.00          c              3
or my way
idx = df.metric_type.str.replace(' ', '')
d1 = df.set_index(idx, append=True).metric_value.unstack()
print(pd.concat([d1, df], axis=1))
       a    b    c metric_type  metric_value
0 1.0000  nan  nan          a              1
1    nan 2.00  nan           b             2
2    nan  nan 3.00          c              3
timing
use a bigger dfdf1 = pd.concat([df] * 30000, ignore_index=True)
%%timeit
idx = df1.metric_type.str.replace(' ', '')
d1 = df1.set_index(idx, append=True).metric_value.unstack()
pd.concat([d1, df1], axis=1)
10 loops, best of 3: 77.3 ms per loop
%%timeit
df1.apply(assign_metric_vals, 1)
1 loop, best of 3: 57.4 s per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With