I have a string that looks like this:
"aaa\n\t\n asd123asd water's tap413 water blooe's"
How can I remove all escape characters, numbers, and punctuation except apostrophe using regex?
I'm pretty new to regex, and would appreciate it if you can explain what each expression means, if the regex formula is to be complicated
You're looking for a search and replace method, which in Python should be re#sub()
.
Simply replace non-letters & apostrophe ([^a-zA-Z' ]+
) with ''
(nothing).
- Oh well, what about the escaped characters?
R: They will turn into a single character when inside the string, \n
will be turned into a newline character for example, which is not a letter or a '
.
Instead, if you actually have escaped an escaped character in your string (like: "abc\\nefg"
), you should add a \\\\.|
at the start of your regex, which will match the backslash + any other character (so it will be: \\\\.|[^a-zA-Z' ]
)
Here is the working exemple:
import re
s = "aaa\n\t\n asd123asd water's tap413 water blooe's"
replaced = re.sub("[^a-zA-Z' ]+", '', s)
print(replaced)
https://repl.it/repls/ReasonableUtterAnglerfish
Would appreciate it if you can explain what each expression means
So, the explanation:
\\\\
- Matches a backslash (Why four? Each pair will escape the slash for the Python string's compilation, which will turn into a \\
which is how you match a backslash in regex)..
- Match any character except for the newline character.|
- OR expression, matches what is before OR what is after.[^...]
- Must NOT be one of these characters (inside).a-zA-Z'
- Match characters from a
to z
, A
to Z
, '
or
.+
- Quantifier, not needed here, but would be good to reduce the matches, hence reduce the time of execution (Which would translate as "One or more occurrences of the term behind").If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With