How can you make numerous modifications to dataframe columns avoiding boilerplate code.
Reproducible example:
data = {'Subject Id': ['1', '2', '3'],
'First-Name': ['Alex', 'Amy', 'Allen'],
'Last, name': ['Anderson', 'Ackerman', 'Ali']}
df = pd.DataFrame(data, columns = ['Subject Id', 'First-Name', 'Last, name'])
df
Subject Id First-Name Last, name
0 1 Alex Anderson
1 2 Amy Ackerman
2 3 Allen Ali
To clean the column names I'd usually do something like this:
df.columns = [l.lower() for l in df.columns]
df.columns = [s.replace('-', ' ') for s in df.columns]
df.columns = [d.replace(',', ' ') for d in df.columns]
But sometimes I need to make far more than 3 modifications. Is there a way to chain such operations together or otherwise do this more efficiently?
You can call vectorised .str
methods and chain these calls on your columns, here we use str.lower
and str.replace
:
In [91]:
df.columns = df.columns.str.lower().str.replace('-|,', ' ')
df
Out[91]:
subject id first name last name
0 1 Alex Anderson
1 2 Amy Ackerman
2 3 Allen Ali
Note also there was nothing stopping you from just combining everything in a single list comprehension:
In [93]:
df.columns = [l.lower().replace('-', ' ').replace(',',' ') for l in df.columns]
df
Out[93]:
subject id first name last name
0 1 Alex Anderson
1 2 Amy Ackerman
2 3 Allen Ali
A list comprehension maybe quicker on such a small number of columns:
timings
In [96]:
%timeit [l.lower().replace('-', ' ').replace(',',' ') for l in df.columns]
%timeit df.columns.str.lower().str.replace('-|,', ' ')
100000 loops, best of 3: 5.26 µs per loop
1000 loops, best of 3: 284 µs per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With