I'm struggling again to improve the execution time of this piece of code. Since the calculations are really time-consuming I think that the best solution would be to parallelize the code.
I was first working with maps as explained in this question, but then I tried a more simple approach thinking that I could find a better solution. However I couldn't come up with anything yet, so since it's a different problem I decided to post it as a new question.
I am working on a Windows platform, using Python 3.4.
Here's the code:
similarity_matrix = [[0 for x in range(word_count)] for x in range(word_count)]
for i in range(0, word_count):
for j in range(0, word_count):
if i > j:
similarity = calculate_similarity(t_matrix[i], t_matrix[j])
similarity_matrix[i][j] = similarity
similarity_matrix[j][i] = similarity
This is the calculate_similarity
function:
def calculate_similarity(array_word1, array_word2):
denominator = sum([array_word1[i] + array_word2[i] for i in range(word_count)])
if denominator == 0:
return 0
numerator = sum([2 * min(array_word1[i], array_word2[i]) for i in range(word_count)])
return numerator / denominator
And the explanation for the code:
word_count
is the total number of unique words stored in a listt_matrix
is a matrix containing a value for each pair of wordssimilarity_matrix
whose dimension is word_count x word_count
also containing a similarity value for each pair of wordscalculate_similarity
takes two float lists, each for a separate word (each is a row in the t_matrix)I work with a list of 13k words, and if I calculated correctly the execution time on my system would be a few days. So, anything that will do the job in one day would be wonderful!
Maybe only parellelizing the calculation of numerator
and denominator
in calculate_similarity
would make a significant improvement.
Here's an alternative implementation of the same general algorithm as in Matt's answer, just using multiprocessing.Pool
instead of concurrent.futures.ProcessPoolExecutor
. It may be more efficient than his code, since the values of the input (t_matrix
) are only serialized once and passed to the initializer
function in each worker process.
import multiprocessing
import itertools
def worker_init(matrix):
global worker_matrix
worker_matrix = matrix
def worker(i, j):
similarity = calculate_similarity(worker_matrix[i], worker_matrix[j])
return i, j, similarity
def main(matrix):
size = len(matrix)
result = [[0]*size for _ in range(size)]
with multiprocessing.Pool(initializer=worker_init, initargs=(matrix,)) as pool:
for i, j, val in pool.starmap(worker, itertools.combinations(range(size), 2)):
result[i][j] = result[j][i] = val
return result
if __name__ == "__main__":
# get t_matrix from somewhere
main(t_matrix)
from concurrent.futures import ProcessPoolExecutor, Future, wait
from itertools import combinations
from functools import partial
similarity_matrix = [[0]*word_count for _ in range(word_count)]
def callback(i, j, future):
similarity_matrix[i][j] = future.result()
similarity_matrix[j][i] = future.result()
with ProcessPoolExecutor(max_workers=4) as executer:
fs = []
for i, j in combinations(range(wordcount), 2):
future = excuter.submit(
calculate_similarity,
t_matrix[i],
t_matrix[j])
future.add_done_callback(partial(callback, i, j))
fs.append(future)
wait(fs)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With