I am parsing data from an Excel file that has extra white space in some of the column headings.
When I check the columns of the resulting dataframe, with df.columns
, I see:
Index(['Year', 'Month ', 'Value']) ^ # Note the unwanted trailing space on 'Month '
Consequently, I can't do:
df["Month"]
Because it will tell me the column is not found, as I asked for "Month", not "Month ".
My question, then, is how can I strip out the unwanted white space from the column headings?
Stripping the leading and trailing spaces of column in pandas data frames can be achieved by using str. strip() function.
To strip whitespaces from column names, you can use str. strip, str. lstrip and str. rstrip.
You can use DataFrame. select_dtypes to select string columns and then apply function str. strip .
Remove Suffix from column names in Pandas You can use the string rstrip() function or the string replace() function to remove suffix from column names.
You can give functions to the rename
method. The str.strip()
method should do what you want:
In [5]: df Out[5]: Year Month Value 0 1 2 3 [1 rows x 3 columns] In [6]: df.rename(columns=lambda x: x.strip()) Out[6]: Year Month Value 0 1 2 3 [1 rows x 3 columns]
Note: that this returns a DataFrame
object and it's shown as output on screen, but the changes are not actually set on your columns. To make the changes, either use this in a method chain or re-assign the df
variabe:
df = df.rename(columns=lambda x: x.strip())
Since version 0.16.1 you can just call .str.strip
on the columns:
df.columns = df.columns.str.strip()
Here is a small example:
In [5]: df = pd.DataFrame(columns=['Year', 'Month ', 'Value']) print(df.columns.tolist()) df.columns = df.columns.str.strip() df.columns.tolist() ['Year', 'Month ', 'Value'] Out[5]: ['Year', 'Month', 'Value']
Timings
In[26]: df = pd.DataFrame(columns=[' year', ' month ', ' day', ' asdas ', ' asdas', 'as ', ' sa', ' asdas ']) df Out[26]: Empty DataFrame Columns: [ year, month , day, asdas , asdas, as , sa, asdas ] %timeit df.rename(columns=lambda x: x.strip()) %timeit df.columns.str.strip() 1000 loops, best of 3: 293 µs per loop 10000 loops, best of 3: 143 µs per loop
So str.strip
is ~2X faster, I expect this to scale better for larger dfs
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With