I did this:
from urllib import urlopen
import nltk
url = http://myurl.com
html = urlopen(url).read()
cleanhtml = nltk.clean_html(html)
I now have a long string in python which is full of text interrupted periodically by windows newlines /r/n
, and I simply want to remove all of the occurrences of /r/n from the string using a regular expression. First I want to replace it with a space. As such, I did this:
import re
textspaced = re.sub("'\r\n'", r"' '", cleanhtml)
...it didn't work. So what am I doing wrong?
replace('\n','') is the correct method to remove all carriage returns.
Line breaks In pattern matching, the symbols “^” and “$” match the beginning and end of the full file, not the beginning and end of a line. If you want to indicate a line break when you construct your RegEx, use the sequence “\r\n”.
Use the strip() Function to Remove a Newline Character From the String in Python. The strip() function is used to remove both trailing and leading newlines from the string that it is being operated on. It also removes the whitespaces on both sides of the string.
There's no need to use regular expressions, just
htmlspaced = html.replace('\r\n', ' ')
If you need to also match UNIX and oldMac newlines, use regular expressions:
import re
htmlspaces = re.sub(r'\r\n|\r|\n', ' ', html)
Just a small syntax error:
htmlspaced = re.sub(r"\r\n", " ", html)
should work.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With