Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python regex to replace all windows newlines with spaces

Tags:

python

regex

I did this:

from urllib import urlopen
import nltk
url = http://myurl.com
html = urlopen(url).read()
cleanhtml = nltk.clean_html(html)

I now have a long string in python which is full of text interrupted periodically by windows newlines /r/n, and I simply want to remove all of the occurrences of /r/n from the string using a regular expression. First I want to replace it with a space. As such, I did this:

import re
textspaced = re.sub("'\r\n'", r"' '", cleanhtml)

...it didn't work. So what am I doing wrong?

like image 956
magnetar Avatar asked Jun 29 '11 16:06

magnetar


People also ask

How do you remove all newlines in Python?

replace('\n','') is the correct method to remove all carriage returns.

How do you match line breaks in RegEx?

Line breaks In pattern matching, the symbols “^” and “$” match the beginning and end of the full file, not the beginning and end of a line. If you want to indicate a line break when you construct your RegEx, use the sequence “\r\n”.

How do you remove paragraph breaks in Python?

Use the strip() Function to Remove a Newline Character From the String in Python. The strip() function is used to remove both trailing and leading newlines from the string that it is being operated on. It also removes the whitespaces on both sides of the string.


2 Answers

There's no need to use regular expressions, just

htmlspaced = html.replace('\r\n', ' ')

If you need to also match UNIX and oldMac newlines, use regular expressions:

import re
htmlspaces = re.sub(r'\r\n|\r|\n', ' ', html)
like image 151
phihag Avatar answered Oct 05 '22 07:10

phihag


Just a small syntax error:

htmlspaced = re.sub(r"\r\n", " ", html)

should work.

like image 33
Tim Pietzcker Avatar answered Oct 05 '22 08:10

Tim Pietzcker