I have a huge corpus of text (line by line) and I want to remove special characters but sustain the space and structure of the string.
hello? there A-Z-R_T(,**), world, welcome to python.
this **should? the next line#followed- by@ an#other %million^ %%like $this.
should be
hello there A Z R T world welcome to python
this should be the next line followed by another million like this
Remove Special Characters From the String in Python Using the str. isalnum() Method. The str. isalnum() method returns True if the characters are alphanumeric characters, meaning no special characters in the string.
Python Remove Character from String using translate() Python string translate() function replace each character in the string using the given translation table. We have to specify the Unicode code point for the character and 'None' as a replacement to remove it from the result string.
To remove multiple characters from a string we can easily use the function str. replace and pass a parameter multiple characters. The String class (Str) provides a method to replace(old_str, new_str) to replace the sub-strings in a string. It replaces all the elements of the old sub-string with the new sub-string.
You can use this pattern, too, with regex
:
import re
a = '''hello? there A-Z-R_T(,**), world, welcome to python.
this **should? the next line#followed- by@ an#other %million^ %%like $this.'''
for k in a.split("\n"):
print(re.sub(r"[^a-zA-Z0-9]+", ' ', k))
# Or:
# final = " ".join(re.findall(r"[a-zA-Z0-9]+", k))
# print(final)
Output:
hello there A Z R T world welcome to python
this should the next line followed by an other million like this
Edit:
Otherwise, you can store the final lines into a list
:
final = [re.sub(r"[^a-zA-Z0-9]+", ' ', k) for k in a.split("\n")]
print(final)
Output:
['hello there A Z R T world welcome to python ', 'this should the next line followed by an other million like this ']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With