I have a dictionary file that contains a word in each line.
titles-sorted.txt
a&a
a&b
a&c_bus
a&e
a&f
a&m
....
For each word, its line number is the word's id.
Then I have another file that contains a set of words separated by tab in each line.
a.txt
a_15 a_15_highway_(sri_lanka) a_15_motorway a_15_motorway_(germany) a_15_road_(sri_lanka)
I'd like to replace all of the words by id if it exists in the dictionary, so that the output looks like,
3454 2345 123 5436 322 ....
So I wrote such python code to do this:
f = open("titles-sorted.txt")
lines = f.readlines()
titlemap = {}
nr = 1
for l in lines:
l = l.replace("\n", "")
titlemap[l.lower()] = nr
nr+=1
fw = open("a.index", "w")
f = open("a.txt")
lines = f.readlines()
for l in lines:
tokens = l.split("\t")
if tokens[0] in titlemap.keys():
fw.write(str(titlemap[tokens[0]]) + "\t")
for t in tokens[1:]:
if t in titlemap.keys():
fw.write(str(titlemap[t]) + "\t")
fw.write("\n")
fw.close()
f.close()
But this code is ridiculously slow, so it makes me suspicious if I have done everything right.
Is this an efficient way to do this?
The write loop contains a lot of calls to write, which are usually inefficient. You can probably speed things up by writing only once per line (or once per file if the file is small enough)
tokens = l.split("\t")
fw.write('\t'.join(fw.write(str(titlemap[t])) for t in tokens if t in titlemap)
fw.write("\n")
or even:
lines = []
for l in f:
lines.append('\t'.join(fw.write(str(titlemap[t])) for t in l.split('\t') if t in titlemap)
fw.write('\n'.join(lines))
Also, if your tokens are used more than once, you can save time by converting them to string when you read then:
titlemap = {l.strip().lower(): str(index) for index, l in enumerate(f, start=1)}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With