I am a complete newbie to Python, and I'm stuck with a regex problem. I'm trying to remove the line break character at the end of each line in a text file, but only if it follows a lowercase letter, i.e. [a-z]
. If the end of the line ends in a lower case letter, I want to replace the line break/newline character with a space.
This is what I've got so far:
import re
import sys
textout = open("output.txt","w")
textblock = open(sys.argv[1]).read()
textout.write(re.sub("[a-z]\z","[a-z] ", textblock, re.MULTILINE) )
textout.close()
Use the String. replace() method to remove all line breaks from a string, e.g. str. replace(/[\r\n]/gm, ''); . The replace() method will remove all line breaks from the string by replacing them with an empty string.
Line breaks In pattern matching, the symbols “^” and “$” match the beginning and end of the full file, not the beginning and end of a line. If you want to indicate a line break when you construct your RegEx, use the sequence “\r\n”.
trim method removes any line breaks from the start and end of a string. It handles all line terminator characters (LF, CR, etc). The method also removes any leading or trailing spaces or tabs. The trim() method doesn't change the original string, it returns a new string.
Matches a form-feed character. \n. Matches a newline character. \r. Matches a carriage return character.
Try
re.sub(r"(?<=[a-z])\r?\n"," ", textblock)
\Z
only matches at the end of the string, after the last linebreak, so it's definitely not what you need here. \z
is not recognized by the Python regex engine.
(?<=[a-z])
is a positive lookbehind assertion that checks if the character before the current position is a lowercase ASCII character. Only then the regex engine will try to match a line break.
Also, always use raw strings with regexes. Makes backslashes easier to handle.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With