I'm doing an Information Retrieval task. I built a simple searchengine. The InvertedIndex is a python dictionary object which is serialized (pickled in python terminology) to a file. Size of this file is InvertedIndex is just 6.5MB.
So, my Code just unpickles it and searches it for query & ranks the matching documents according to TF-IDF score. Doesn't sound anything huge right?
It started running 30min ago and still running. Private Bytes & Virtual Size usage of pythonw.exe
running my 100line python script is 88MB and 168MB respectively.
When I tried it with index of smaller size it was fast. Is it the python or my code? Why is it so slow?
stopwords = ['a' , 'a\'s' , 'able' , 'about' , 'above' , 'according' , 'accordingly' , 'across' , 'actually' , 'after' , 'afterwards' , 'again' , 'against' , 'ain\'t' , 'all' , 'allow' , 'allows' , 'almost' , 'alone' , 'along' , 'already' , 'also' , 'although' , 'always' , 'am' , 'among' , 'amongst' , 'an' , 'and' , 'another' , 'any' , 'anybody' , 'anyhow' , 'anyone' , 'anything' , 'anyway' , 'anyways' , 'anywhere' , 'apart' , 'appear' , 'appreciate' , 'appropriate' , 'are' , 'aren\'t' , 'around' , 'as' , 'aside' , 'ask' , 'asking' , 'associated' , 'at' , 'available' , 'away' , 'awfully' , 'b' , 'be' , 'became' , 'because' , 'become' , 'becomes' , 'becoming' , 'been' , 'before' , 'beforehand' , 'behind' , 'being' , 'believe' , 'below' , 'beside' , 'besides' , 'best' , 'better' , 'between' , 'beyond' , 'both' , 'brief' , 'but' , 'by' , 'c' , 'c\'mon' , 'c\'s' , 'came' , 'can' , 'can\'t' , 'cannot' , 'cant' , 'cause' , 'causes' , 'certain' , 'certainly' , 'changes' , 'clearly' , 'co' , 'com' , 'come' , 'comes' , 'concerning' , 'consequently' , 'consider' , 'considering' , 'contain' , 'containing' , 'contains' , 'corresponding' , 'could' , 'couldn\'t' , 'course' , 'currently' , 'd' , 'definitely' , 'described' , 'despite' , 'did' , 'didn\'t' , 'different' , 'do' , 'does' , 'doesn\'t' , 'doing' , 'don\'t' , 'done' , 'down' , 'downwards' , 'during' , 'e' , 'each' , 'edu' , 'eg' , 'eight' , 'either' , 'else' , 'elsewhere' , 'enough' , 'entirely' , 'especially' , 'et' , 'etc' , 'even' , 'ever' , 'every' , 'everybody' , 'everyone' , 'everything' , 'everywhere' , 'ex' , 'exactly' , 'example' , 'except' , 'f' , 'far' , 'few' , 'fifth' , 'first' , 'five' , 'followed' , 'following' , 'follows' , 'for' , 'former' , 'formerly' , 'forth' , 'four' , 'from' , 'further' , 'furthermore' , 'g' , 'get' , 'gets' , 'getting' , 'given' , 'gives' , 'go' , 'goes' , 'going' , 'gone' , 'got' , 'gotten' , 'greetings' , 'h' , 'had' , 'hadn\'t' , 'happens' , 'hardly' , 'has' , 'hasn\'t' , 'have' , 'haven\'t' , 'having' , 'he' , 'he\'s' , 'hello' , 'help' , 'hence' , 'her' , 'here' , 'here\'s' , 'hereafter' , 'hereby' , 'herein' , 'hereupon' , 'hers' , 'herself' , 'hi' , 'him' , 'himself' , 'his' , 'hither' , 'hopefully' , 'how' , 'howbeit' , 'however' , 'i' , 'i\'d' , 'i\'ll' , 'i\'m' , 'i\'ve' , 'ie' , 'if' , 'ignored' , 'immediate' , 'in' , 'inasmuch' , 'inc' , 'indeed' , 'indicate' , 'indicated' , 'indicates' , 'inner' , 'insofar' , 'instead' , 'into' , 'inward' , 'is' , 'isn\'t' , 'it' , 'it\'d' , 'it\'ll' , 'it\'s' , 'its' , 'itself' , 'j' , 'just' , 'k' , 'keep' , 'keeps' , 'kept' , 'know' , 'knows' , 'known' , 'l' , 'last' , 'lately' , 'later' , 'latter' , 'latterly' , 'least' , 'less' , 'lest' , 'let' , 'let\'s' , 'like' , 'liked' , 'likely' , 'little' , 'look' , 'looking' , 'looks' , 'ltd' , 'm' , 'mainly' , 'many' , 'may' , 'maybe' , 'me' , 'mean' , 'meanwhile' , 'merely' , 'might' , 'more' , 'moreover' , 'most' , 'mostly' , 'much' , 'must' , 'my' , 'myself' , 'n' , 'name' , 'namely' , 'nd' , 'near' , 'nearly' , 'necessary' , 'need' , 'needs' , 'neither' , 'never' , 'nevertheless' , 'new' , 'next' , 'nine' , 'no' , 'nobody' , 'non' , 'none' , 'noone' , 'nor' , 'normally' , 'not' , 'nothing' , 'novel' , 'now' , 'nowhere' , 'o' , 'obviously' , 'of' , 'off' , 'often' , 'oh' , 'ok' , 'okay' , 'old' , 'on' , 'once' , 'one' , 'ones' , 'only' , 'onto' , 'or' , 'other' , 'others' , 'otherwise' , 'ought' , 'our' , 'ours' , 'ourselves' , 'out' , 'outside' , 'over' , 'overall' , 'own' , 'p' , 'particular' , 'particularly' , 'per' , 'perhaps' , 'placed' , 'please' , 'plus' , 'possible' , 'presumably' , 'probably' , 'provides' , 'q' , 'que' , 'quite' , 'qv' , 'r' , 'rather' , 'rd' , 're' , 'really' , 'reasonably' , 'regarding' , 'regardless' , 'regards' , 'relatively' , 'respectively' , 'right' , 's' , 'said' , 'same' , 'saw' , 'say' , 'saying' , 'says' , 'second' , 'secondly' , 'see' , 'seeing' , 'seem' , 'seemed' , 'seeming' , 'seems' , 'seen' , 'self' , 'selves' , 'sensible' , 'sent' , 'serious' , 'seriously' , 'seven' , 'several' , 'shall' , 'she' , 'should' , 'shouldn\'t' , 'since' , 'six' , 'so' , 'some' , 'somebody' , 'somehow' , 'someone' , 'something' , 'sometime' , 'sometimes' , 'somewhat' , 'somewhere' , 'soon' , 'sorry' , 'specified' , 'specify' , 'specifying' , 'still' , 'sub' , 'such' , 'sup' , 'sure' , 't' , 't\'s' , 'take' , 'taken' , 'tell' , 'tends' , 'th' , 'than' , 'thank' , 'thanks' , 'thanx' , 'that' , 'that\'s' , 'thats' , 'the' , 'their' , 'theirs' , 'them' , 'themselves' , 'then' , 'thence' , 'there' , 'there\'s' , 'thereafter' , 'thereby' , 'therefore' , 'therein' , 'theres' , 'thereupon' , 'these' , 'they' , 'they\'d' , 'they\'ll' , 'they\'re' , 'they\'ve' , 'think' , 'third' , 'this' , 'thorough' , 'thoroughly' , 'those' , 'though' , 'three' , 'through' , 'throughout' , 'thru' , 'thus' , 'to' , 'together' , 'too' , 'took' , 'toward' , 'towards' , 'tried' , 'tries' , 'truly' , 'try' , 'trying' , 'twice' , 'two' , 'u' , 'un' , 'under' , 'unfortunately' , 'unless' , 'unlikely' , 'until' , 'unto' , 'up' , 'upon' , 'us' , 'use' , 'used' , 'useful' , 'uses' , 'using' , 'usually' , 'uucp' , 'v' , 'value' , 'various' , 'very' , 'via' , 'viz' , 'vs' , 'w' , 'want' , 'wants' , 'was' , 'wasn\'t' , 'way' , 'we' , 'we\'d' , 'we\'ll' , 'we\'re' , 'we\'ve' , 'welcome' , 'well' , 'went' , 'were' , 'weren\'t' , 'what' , 'what\'s' , 'whatever' , 'when' , 'whence' , 'whenever' , 'where' , 'where\'s' , 'whereafter' , 'whereas' , 'whereby' , 'wherein' , 'whereupon' , 'wherever' , 'whether' , 'which' , 'while' , 'whither' , 'who' , 'who\'s' , 'whoever' , 'whole' , 'whom' , 'whose' , 'why' , 'will' , 'willing' , 'wish' , 'with' , 'within' , 'without' , 'won\'t' , 'wonder' , 'would' , 'would' , 'wouldn\'t' , 'x' , 'y' , 'yes' , 'yet' , 'you' , 'you\'d' , 'you\'ll' , 'you\'re' , 'you\'ve' , 'your' , 'yours' , 'yourself' , 'yourselves' , 'z' , 'zero']
import PorterStemmer
import math
import pickle
def TF(term,doc):
#Term Frequency: No. of times `term` occured in `doc`
global InvertedIndex
idx = InvertedIndex[term].index(doc)
count = 0
while (idx < len(InvertedIndex[term])) and InvertedIndex[term][idx] == doc:
count= count+1
idx = idx+1
return count
def DF(term):
#Document Frequency: No. of documents containing `term`
global InvertedIndex
return len(set(InvertedIndex[term]))
def avgTF(term, doc):
global docs
TFs = []
for term in docs[doc]:
TFs.append(TF(term,doc))
return sum(TFs)/len(TFs)
def maxTF(term, doc):
global docs
TFs = []
for term in docs[doc]:
TFs.append(TF(term,doc))
return max(TFs)
def getValues4Term(term, doc):
TermFrequency = {}
TermFrequency['natural'] = TF(term,doc)
TermFrequency['log'] = 1+math.log( TF(term,doc) )
TermFrequency['aug'] = 0.5+float(0.5*TF(term,doc)/maxTF(term,doc))
TermFrequency['bool'] = 1 if TF(term,doc)>0 else 0
TermFrequency['log_avg'] = float(1+math.log( TF(term,doc) ))/(1+math.log( avgTF(term,doc) ))
DocumentFrequency = {}
DocumentFrequency['no'] = 1
DocumentFrequency['idf'] = math.log( len(docs)/DF(term) )
DocumentFrequency['probIDF'] = max( [0, math.log( float(len(docs)-DF(term))/DF(term) )] )
return [TermFrequency, DocumentFrequency]
def Cosine(resultDocVector, qVector, doc):
#`doc` parameter is the document number corresponding to resultDocVector
global qterms,docs
# Defining Cosine similarity : cos(a) = A.B/|A||B|
dotProduct = 0
commonTerms_q_d = set(qterms).intersection(docs[doc]) #commonTerms in both query & document
for cmnTerm in commonTerms_q_d:
dotProduct = dotProduct + resultDocVector[docs[doc].index(cmnTerm)] * qVector[qterms.index(cmnTerm)]
resultSquares = []
for k in resultDocVector:
resultSquares.append(k*k)
qSquares = []
for k in qVector:
qSquares.append(k*k)
denominator = math.sqrt(sum(resultSquares)) * math.sqrt(sum(qSquares))
return dotProduct/denominator
def load():
#load index from a file
global InvertedIndex, docIDs, docs
PICKLE_InvertedIndex_FILE = open("InvertedIndex.db", 'rb')
InvertedIndex = pickle.load(PICKLE_InvertedIndex_FILE)
PICKLE_InvertedIndex_FILE.close()
PICKLE_docIDs_FILE = open("docIDs.db", 'rb')
docIDs = pickle.load(PICKLE_docIDs_FILE)
PICKLE_docIDs_FILE.close()
PICKLE_docs_FILE = open("docs.db", 'rb')
docs = pickle.load(PICKLE_docs_FILE)
PICKLE_docs_FILE.close()
########################
docs = []
docIDs = []
InvertedIndex = {}
load()
stemmer = PorterStemmer.PorterStemmer()
#<getting results for a query
query = 'Antarctica exploration'
qwords = query.strip().split()
qterms = []
qterms1 = []
for qword in qwords:
qword = qword.lower()
if qword in stopwords:
continue
qterm = stemmer.stem(qword,0,len(qword)-1)
qterms1.append(qterm)
qterms = list(set(qterms1))
#getting posting lists for each qterms & merging them
prev = set()
i = 0
for qterm in qterms:
if InvertedIndex.has_key(qterm):
if i == 0:
prev = set(InvertedIndex[qterm])
i = i+1
continue
prev = prev.intersection(set(InvertedIndex[qterm]))
results = list(prev)
#</getting results for a query
#We've got the results. Now lets rank them using Cosine similarity.
i = 0
docComponents = []
for doc in results:
docComponents.append([])
i = 0
for doc in results:
for term in docs[doc]:
vals = getValues4Term(term,doc)#[TermFrequency, DocumentFrequency]
docComponents[i].append(vals)
i = i+1
#Normalization = {}
# forming vectors for each document in the result
i = 0 #document iterator
j = 0 #term iterator
resultDocVectors = []#contains document vector for each result.
for doc in results:
resultDocVectors.append([])
for i in range(0,len(results)):
for j in range(0,len(docs[doc])):
tf = docComponents[i][j][0]['natural']#0:TermFrequency
idf = docComponents[i][j][1]['idf'] #1:DocumentFrequency
resultDocVectors[i].append(tf*idf)
#forming vector for query
qVector = []
qTF = []
qDF = []
for qterm in qterms:
count = 0
idx = qterms1.index(qterm)
while idx < len(qterms1) and qterms1[idx] == qterm:
count= count+1
idx = idx+1
qTF.append(count)
qVector = qTF
#compuing Cosine similarities of all resultDocVectors w.r.t qVector
i = 0
CosineVals = []
for resultDocVector in resultDocVectors:
doc = results[i]
CosineVals.append(Cosine(resultDocVector, qVector, doc))
i = i+1
#ranking as per Cosine Similarities
#this is not "perfect" sorting. As it may not give 100% correct results when it multiple docs have same cosine similarities.
CosineValsCopy = CosineVals
CosineVals.sort()
sortedCosineVals = CosineVals
CosineVals = CosineValsCopy
rankedResults = []
for cval in sortedCosineVals:
rankedResults.append(results[CosineVals.index(cval)])
rankedResults.reverse()
#<Evaluation of the system:>
#parsing qrels.txt & getting relevances
# qrels.txt contains columns of the form:
# qid iter docno rel
#2nd column `iter` can be ignored.
relevances = {}
fh = open("qrels.txt")
lines = fh.readlines()
for line in lines:
cols = line.strip().split()
if relevances.has_key(cols[0]):#queryID
relevances[cols[0]].append(cols[2])#docID
else:
relevances[cols[0]] = [cols[2]]
fh.close()
#precision = no. of relevant docs retrieved/total no. of docs retrieved
no_of_relevant_docs_retrieved = set(rankedResults).intersection( set(relevances[queryID]) )
Precision = no_of_relevant_docs_retrieved/len(rankedResults)
#recall = no. of relevant docs retrieved/ total no. of relevant docs
Recall = no_of_relevant_docs_retrieved/len(relevances[queryID])
It's definitely your code, but since you choose to hide it from us it's impossible for us to help any further. All I can tell you based on the very scarce info you choose to supply is that unpickling a dict (in the right way) is much faster, and indexing into it (assuming that's what you mean by "searches it for query") is blazingly fast. It's from this data that I deduce that the cause of your slowdown must be something else you do, or do wrong, in your code.
Edit: now that you have posted you code I notice, just at a glance, that a lot of nontrivial code is running at module top level. Really horrible practice, and detrimental to performance: put all your nontrivial code into a function, and call that function -- that by itself will give you a tens-of-percent speedup, at zero cost of complication. I must have mentioned that crucial fact at least 20 times just in my Stack Overflow posts, not to mention "Python in a Nutshell" etc -- surely if you care about performance you cannot blithely ignore such easily available and widespread information?!
More easily-fixed wastes of runtime:
import pickle
use cPickle
(if you're not on Python 2.6 or 2.7, but rather on 3.1, then there may be other causes of performance issues -- I don't know how finely tuned 3.1 is at this time compared with the awesome performance of 2.6 and 2.7).
All of your global
statements are useless except the one in load
(not a serious performance hit, but redundant and useless code should be eliminated on principle). You only need a global
if you want to bind a module-global variable from within a function, and load
is the only one where you're doing that.
More editing:
Now we get to more important stuff: the values in InvertedIndex
appear to be lists of docs, so to know how many times a doc appears in one you have to loop throughout it. Why not make each of the values be instead a dict from doc to number of occurrences? No looping (and no looping where you now do the len(set(...))
of a value in InvertedIndex
-- just the len
will then be equivalent and you save the set(...)
operation which implicitly has to loop to perform its job). This is a big-O optimization, not "just" the kind of speedup by 20% or so that things I've mentioned so far may account for -- i.e., this is more important stuff, as I said. Use the right data structures and algorithms and a lot of minor inefficiencies may become relatively unimportant; use the wrong ones, and there's no saving your code's performance as the input size grows, no matter how cleverly you micro-optimize the wrong data structures and algorithms;-).
And more: you repeat your computations a lot, "from scratch" each time -- e.g., look how many times you call TF(term, doc)
for each given term and doc (and each call has the crucial inefficiency I just explained too). As the quickest way to fix this huge inefficiency, use memoization -- e.g., with the memoized
decorator found here.
OK, it's getting late for me and I'd better head off to bed -- I hope some or all of the above suggestions were of some use to you!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With