I am trying to determine document similarity between a single document and each of a large number of documents (n ~= 1 million) as quickly as possible. More specifically, the documents I'm comparing are e-mails; they are grouped (i.e., there are folders or tags) and I'd like to determine which group is most appropriate for a new e-mail. Fast performance is critical.
My a priori assumption is that the cosine similarity between term vectors is appropriate for this application; please comment on whether this is a good measure to use or not!
I have already taken into account the following possibilities for speeding up performance:
Pre-normalize all the term vectors
Calculate a term vector for each group (n ~= 10,000) rather than each e-mail (n ~= 1,000,000); this would probably be acceptable for my application, but if you can think of a reason not to do it, let me know!
I have a few questions:
If a new e-mail has a new term never before seen in any of the previous e-mails, does that mean I need to re-compute all of my term vectors? This seems expensive.
Is there some clever way to only consider vectors which are likely to be close to the query document?
Is there some way to be more frugal about the amount of memory I'm using for all these vectors?
Thanks!
Use Bayesian filtering. The link provided refers to spam filtering, but you can adapt the algorithm pretty easily to multiple categories/tags.
There are lots of good SO question about Bayesian filtering, too.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With