Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

More efficient way to clean a column of strings and add a new column

I have a dataframe df that has the columns ['metric_type', 'metric_value']. For each row, I want to make sure I have a column with the name equal to 'metric_type' and a value for that column equal to 'metric_value'.

One of my problems is that 'metric_type' has spurious spaces that I want to get rid of.

Consider the dataframe df:

df = pd.DataFrame([
        ['a ', 1],
        [' b', 2],
        [' c ', 3]
    ], columns=['metric_type', 'metric_value'])

print(df)

  metric_type  metric_value
0          a              1
1           b             2
2          c              3

Notice that each value of 'metric_type' has spaces in varying places.

I created a function to use apply but it takes a horribly long time.

def assign_metric_vals(row):
    row[row['metric_type'].replace(" ", "")] = row['metric_value']
    return row

When I use it, I get this:

       a    b    c metric_type  metric_value
0 1.0000  nan  nan          a              1
1    nan 2.00  nan           b             2
2    nan  nan 3.00          c              3

Is there a better (read, "faster") way to accomplish this same task?

like image 728
user3002486 Avatar asked Dec 21 '16 18:12

user3002486


1 Answers

You're much better served setting your index with metric_type and unstacking.

df.set_index(df.metric_type.str.replace(' ', ''), append=True).metric_value.unstack()

demonstration

df = pd.DataFrame([
        ['a ', 1],
        [' b', 2],
        [' c ', 3]
    ], columns=['metric_type', 'metric_value'])

print(df)

  metric_type  metric_value
0          a              1
1           b             2
2          c              3

print(df.apply(assign_metric_vals, 1))

       a    b    c metric_type  metric_value
0 1.0000  nan  nan          a              1
1    nan 2.00  nan           b             2
2    nan  nan 3.00          c              3

or my way

idx = df.metric_type.str.replace(' ', '')
d1 = df.set_index(idx, append=True).metric_value.unstack()
print(pd.concat([d1, df], axis=1))

       a    b    c metric_type  metric_value
0 1.0000  nan  nan          a              1
1    nan 2.00  nan           b             2
2    nan  nan 3.00          c              3

timing

use a bigger df
df1 = pd.concat([df] * 30000, ignore_index=True)

%%timeit
idx = df1.metric_type.str.replace(' ', '')
d1 = df1.set_index(idx, append=True).metric_value.unstack()
pd.concat([d1, df1], axis=1)

10 loops, best of 3: 77.3 ms per loop

%%timeit
df1.apply(assign_metric_vals, 1)

1 loop, best of 3: 57.4 s per loop

like image 77
piRSquared Avatar answered Sep 29 '22 23:09

piRSquared