Replace several words in a text with Python

Question

I use the below code to remove all HTML tags from a file and convert it to a plain text. Moreover, I have to convert XML/HTML characters to ASCII ones. Here, I have 21 lines which read whole the text. It means if I want to convert a huge file, I have to expend a lot of resource to do this.

Do you have any idea to increase the efficiency of the code and increase its speed while decrease the usage of the resources?

# -*- coding: utf-8 -*-
import re

# This file contains HTML.
file = open('input-file.html', 'r')
temp = file.read()

# Replace Some XML/HTML characters to ASCII ones.
temp = temp.replace ('&lsquo;',"""'""")
temp = temp.replace ('&rsquo;',"""'""")
temp = temp.replace ('&ldquo;',"""\"""")
temp = temp.replace ('&rdquo;',"""\"""")
temp = temp.replace ('&sbquo;',""",""")
temp = temp.replace ('&prime;',"""'""")
temp = temp.replace ('&Prime;',"""\"""")
temp = temp.replace ('&laquo;',"""«""")
temp = temp.replace ('&raquo;',"""»""")
temp = temp.replace ('&lsaquo;',"""‹""")
temp = temp.replace ('&rsaquo;',"""›""")
temp = temp.replace ('&amp;',"""&""")
temp = temp.replace ('&ndash;',""" – """)
temp = temp.replace ('&mdash;',""" — """)
temp = temp.replace ('&reg;',"""®""")
temp = temp.replace ('&copy;',"""©""")
temp = temp.replace ('&trade;',"""™""")
temp = temp.replace ('&para;',"""¶""")
temp = temp.replace ('&bull;',"""•""")
temp = temp.replace ('&middot;',"""·""")

# Replace HTML tags with an empty string.
result = re.sub("<.*?>", "", temp)
print(result)

# Write the result to a new file.
file = open("output-file.txt", "w")
file.write(result)
file.close()

avinash pandey · Accepted Answer

you can use string.translate()

from string import maketrans   # Required to call maketrans function.

intab = "string of original characters that need to be replaced"
outtab = "string of new characters"
trantab = maketrans(intab, outtab)# maketrans() is helper function in the string module to create a translation table

str = "this is string example....wow!!!";#you string
print str.translate(trantab);

Note that in python3 str.translate will be significantly slower than in python2, especially if you translate only few characters. This is because it must handle unicode characters and thus uses a dict to perform the translations instead of indexing a string.

Shashank · Answer

My first instinct is string.translate() in combination with string.maketrans() This will make only one pass instead of several. Each call to str.replace() does its own pass of the entire string and you want to avoid that.

An example:

from string import ascii_lowercase, maketrans, translate

from_str = ascii_lowercase
to_str = from_str[-1]+from_str[0:-1]
foo = 'the quick brown fox jumps over the lazy dog.'
bar = translate(foo, maketrans(from_str, to_str))
print bar # sgd pthbj aqnvm enw itlor nudq sgd kzyx cnf.

Replace several words in a text with Python

Tags:

performance

python

unicode

processing-efficiency

Alin

2 Answers

avinash pandey

Shashank

Recent Activity

Donate For Us

Replace several words in a text with Python

Tags:

performance

python

unicode

processing-efficiency

Alin

2 Answers

avinash pandey

Shashank

Related questions

Recent Activity

Donate For Us