We are working on information retrieval task, and we need to rank research papers due to query.
After cleaning data, and creating dataframe, we have tokenized paper texts and need to save result into file.
import sys
#tokenized_corpus = [doc.split(" ") for doc in corpus]
corpus = list(df.body_text)
tokenized_corpus1 = [doc.split(" ") for doc in corpus[:20000]]
tokenized_corpus2 = [doc.split(" ") for doc in corpus[20000:40000]]
#tokenized_corpus3 = [doc.split(" ") for doc in corpus[40000:]]
tokenized_corpus = tokenized_corpus1 + tokenized_corpus2 # + tokenized_corpus3
cell above create tokenized corpus.
with open('file.csv', 'w', newline='', encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerows(tokenized_corpus)
then we save data to .csv file.
after that, we call BM25Okapi method
bm25 = BM25Okapi(tokenized_corpus)
As this step takes too much time and consumes gigabytes of memory (causing frequent errors) we want to save result, so that we will not need to recall funktion every time.
to retrieve results due to results we used the following steps.
query = "coronavirus origin"
tokenized_query = query.split(" ")
doc_scores = bm25.get_scores(tokenized_query)
doc_scores
I were not able to save BM25 objects value to file. And did not see any method in the source code. How should i do?
Question is asked in a wrong way. What we have to do is saving objects not specifically BM25Okapi results.
so, here goes the solution:
import pickle
#To save bm25 object
with open('bm25result', 'wb') as bm25result_file:
pickle.dump(bm25, bm25result_file)
then, to read the object data:
#to read bm25 object
with open('bm25result', 'rb') as bm25result_file:
bm25result = pickle.load(bm25result_file)
detailed description can be found this article
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With