Given an example dataframe with the 2nd and 3rd columns of free text, e.g.
>>> import pandas as pd
>>> lol = [[1,2,'abc','foo\nbar'], [3,1, 'def\nhaha', 'love it\n']]
>>> pd.DataFrame(lol)
0 1 2 3
0 1 2 abc foo\nbar
1 3 1 def\nhaha love it\n
The goal is to replace the \n
to (whitespace) and strip the string in column 2 and 3 to achieve:
>>> pd.DataFrame(lol)
0 1 2 3
0 1 2 abc foo bar
1 3 1 def haha love it
How to replace newlines with spaces for specific columns through pandas dataframe?
I have tried this:
>>> import pandas as pd
>>> lol = [[1,2,'abc','foo\nbar'], [3,1, 'def\nhaha', 'love it\n']]
>>> replace_and_strip = lambda x: x.replace('\n', ' ').strip()
>>> lol2 = [[replace_and_strip(col) if type(col) == str else col for col in list(row)] for idx, row in pd.DataFrame(lol).iterrows()]
>>> pd.DataFrame(lol2)
0 1 2 3
0 1 2 abc foo bar
1 3 1 def haha love it
But there must be a better/simpler way.
You can replace all values or selected values in a column of pandas DataFrame based on condition by using DataFrame. loc[] , np. where() and DataFrame. mask() methods.
You can refer to column names that contain spaces or operators by surrounding them in backticks. This way you can also escape names that start with a digit, or those that are a Python keyword. Basically when it is not valid Python identifier. See notes down for more details.
strip() function is used to remove or strip the leading and trailing space of the column in pandas dataframe.
Use replace
- first first and last strip and then replace \n
:
df = df.replace({r'\s+$': '', r'^\s+': ''}, regex=True).replace(r'\n', ' ', regex=True)
print (df)
0 1 2 3
0 1 2 abc foo bar
1 3 1 def haha love it
You can select_dtypes
to select columns of type object
and use applymap
on those columns.
Because there is no inplace
argument for these functions, this would be a workaround to make change to the dataframe:
strs = lol.select_dtypes(include=['object']).applymap(lambda x: x.replace('\n', ' ').strip())
lol[strs.columns] = strs
lol
# 0 1 2 3
#0 1 2 abc foo bar
#1 3 1 def haha love it
Adding to the other nice answers, this is a vectorized version of your initial idea:
columns = [2,3]
df.iloc[:, columns] = [df.iloc[:,col].str.strip().str.replace('\n',' ')
for col in columns]
Details:
In [49]: df.iloc[:, columns] = [df.iloc[:,col].str.strip().str.replace('\n',' ')
for col in columns]
In [50]: df
Out[50]:
0 1 2 3
0 1 2 abc def haha
1 3 1 foo bar love it
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With