Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replacing newlines with spaces for str columns through pandas dataframe

Given an example dataframe with the 2nd and 3rd columns of free text, e.g.

>>> import pandas as pd
>>> lol = [[1,2,'abc','foo\nbar'], [3,1, 'def\nhaha', 'love it\n']]
>>> pd.DataFrame(lol)
   0  1          2          3
0  1  2        abc   foo\nbar
1  3  1  def\nhaha  love it\n

The goal is to replace the \n to (whitespace) and strip the string in column 2 and 3 to achieve:

>>> pd.DataFrame(lol)
   0  1         2        3
0  1  2       abc  foo bar
1  3  1  def haha  love it

How to replace newlines with spaces for specific columns through pandas dataframe?

I have tried this:

>>> import pandas as pd
>>> lol = [[1,2,'abc','foo\nbar'], [3,1, 'def\nhaha', 'love it\n']]

>>> replace_and_strip = lambda x: x.replace('\n', ' ').strip()

>>> lol2 = [[replace_and_strip(col) if type(col) == str else col for col in list(row)] for idx, row in pd.DataFrame(lol).iterrows()]

>>> pd.DataFrame(lol2)
   0  1         2        3
0  1  2       abc  foo bar
1  3  1  def haha  love it

But there must be a better/simpler way.

like image 936
alvas Avatar asked Oct 02 '17 09:10

alvas


People also ask

How replace column values with conditions in pandas?

You can replace all values or selected values in a column of pandas DataFrame based on condition by using DataFrame. loc[] , np. where() and DataFrame. mask() methods.

Can pandas column names have spaces?

You can refer to column names that contain spaces or operators by surrounding them in backticks. This way you can also escape names that start with a digit, or those that are a Python keyword. Basically when it is not valid Python identifier. See notes down for more details.

How do you remove leading and trailing spaces in pandas?

strip() function is used to remove or strip the leading and trailing space of the column in pandas dataframe.


3 Answers

Use replace - first first and last strip and then replace \n:

df = df.replace({r'\s+$': '', r'^\s+': ''}, regex=True).replace(r'\n',  ' ', regex=True)
print (df)
   0  1         2        3
0  1  2       abc  foo bar
1  3  1  def haha  love it
like image 161
jezrael Avatar answered Nov 02 '22 23:11

jezrael


You can select_dtypes to select columns of type object and use applymap on those columns.

Because there is no inplace argument for these functions, this would be a workaround to make change to the dataframe:

strs = lol.select_dtypes(include=['object']).applymap(lambda x: x.replace('\n', ' ').strip())
lol[strs.columns] = strs
lol
#   0  1         2        3
#0  1  2       abc  foo bar
#1  3  1  def haha  love it
like image 42
zipa Avatar answered Nov 03 '22 01:11

zipa


Adding to the other nice answers, this is a vectorized version of your initial idea:

columns = [2,3] 
df.iloc[:, columns] = [df.iloc[:,col].str.strip().str.replace('\n',' ') 
                       for col in columns] 

Details:

In [49]: df.iloc[:, columns] = [df.iloc[:,col].str.strip().str.replace('\n',' ') 
                                 for col in columns]  

In [50]: df
Out[50]: 
   0  1        2         3
0  1  2      abc  def haha
1  3  1  foo bar   love it
like image 45
Mohamed Ali JAMAOUI Avatar answered Nov 03 '22 00:11

Mohamed Ali JAMAOUI