I use the below code to remove all HTML tags from a file and convert it to a plain text. Moreover, I have to convert XML/HTML characters to ASCII ones. Here, I have 21 lines which read whole the text. It means if I want to convert a huge file, I have to expend a lot of resource to do this.
Do you have any idea to increase the efficiency of the code and increase its speed while decrease the usage of the resources?
# -*- coding: utf-8 -*-
import re
# This file contains HTML.
file = open('input-file.html', 'r')
temp = file.read()
# Replace Some XML/HTML characters to ASCII ones.
temp = temp.replace ('‘',"""'""")
temp = temp.replace ('’',"""'""")
temp = temp.replace ('“',"""\"""")
temp = temp.replace ('”',"""\"""")
temp = temp.replace ('‚',""",""")
temp = temp.replace ('′',"""'""")
temp = temp.replace ('″',"""\"""")
temp = temp.replace ('«',"""«""")
temp = temp.replace ('»',"""»""")
temp = temp.replace ('‹',"""‹""")
temp = temp.replace ('›',"""›""")
temp = temp.replace ('&',"""&""")
temp = temp.replace ('–',""" – """)
temp = temp.replace ('—',""" — """)
temp = temp.replace ('®',"""®""")
temp = temp.replace ('©',"""©""")
temp = temp.replace ('™',"""™""")
temp = temp.replace ('¶',"""¶""")
temp = temp.replace ('•',"""•""")
temp = temp.replace ('·',"""·""")
# Replace HTML tags with an empty string.
result = re.sub("<.*?>", "", temp)
print(result)
# Write the result to a new file.
file = open("output-file.txt", "w")
file.write(result)
file.close()
you can use string.translate()
from string import maketrans # Required to call maketrans function.
intab = "string of original characters that need to be replaced"
outtab = "string of new characters"
trantab = maketrans(intab, outtab)# maketrans() is helper function in the string module to create a translation table
str = "this is string example....wow!!!";#you string
print str.translate(trantab);
Note that in python3 str.translate will be significantly slower than in python2, especially if you translate only few characters. This is because it must handle unicode characters and thus uses a dict to perform the translations instead of indexing a string.
My first instinct is string.translate() in combination with string.maketrans() This will make only one pass instead of several. Each call to str.replace() does its own pass of the entire string and you want to avoid that.
An example:
from string import ascii_lowercase, maketrans, translate
from_str = ascii_lowercase
to_str = from_str[-1]+from_str[0:-1]
foo = 'the quick brown fox jumps over the lazy dog.'
bar = translate(foo, maketrans(from_str, to_str))
print bar # sgd pthbj aqnvm enw itlor nudq sgd kzyx cnf.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With