Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest C++ Container: Unique Values

I am writing an email application that interfaces with a MySQL database. I have two tables that are sourcing my data, one of which contains unsubscriptions, the other of which is a standard user table. As of now, I'm creating a vector of pointers to email objects, and storing all of the unsubscribed emails in it, initially. I then have a standard SQL loop in which I'm checking to see if the email is not in the unsubscribe vector, then adding it to the global send email vector. My question, is, is there a more efficient way of doing this? I have to search the unsub vector for every single email in my system, up to 50K different ones. Is there a better structure for searching? And, a better structure for maintaining a unique collection of values? Perhaps one that would simply discard the value if it already contains it?

like image 561
Josh Avatar asked Jan 11 '11 16:01

Josh


4 Answers

If your C++ Standard Library implementation supports it, consider using a std::unordered_set or a std::hash_set.

You can also use std::set, though its overhead might be higher (it depends on the cost of generating a hash for the object versus the cost of comparing two of the objects several times).

If you do use a node based container like set or unordered_set, you also get the advantage that removal of elements is relatively cheap compared to removal from a vector.

like image 163
James McNellis Avatar answered Oct 26 '22 08:10

James McNellis


  1. Tasks like this (set manipulations) are better left to what is MEANT to execute them - the database!

    E.g. something along the lines of:

     SELECT email FROM all_emails_table e WHERE NOT EXISTS (
         SELECT 1 FROM unsubscribed u where e.email=u.email
     )
    
  2. If you want an ALGORITHM, you can do this fast by retrieving both the list of emails AND a list of unsubscriptions as ORDERED lists. Then you can go through the e-mail list (which is ordered), and as you do it you glide along the unsubscribe list. The idea is that you move 1 forward in whichever list has the "biggest" current" element. This algo is O(M+N) instead of O(M*N) like your current one

  3. Or, you can do a hash map which maps from unsubscribed e-mail address to 1. Then you do find() calls on that map whcih for correct hash implementations are O(1) for each lookup. Unfortunately, there's no Hash Map standard in C++ - please see this SO question for existing implementations (couple of ideas there are SGI's STL hash_map and Boost and/or TR1 std::tr1::unordered_map).

    One of the comments on that post indicates it will be added to the standard: "With this in mind, the C++ Standard Library Technical Report introduced the unordered associative containers, which are implemented using hash tables, and they have now been added to the Working Draft of the C++ Standard."

like image 25
DVK Avatar answered Oct 26 '22 08:10

DVK


Store your email adresses in a std::set or use std::set_difference().

like image 4
Sven Marnach Avatar answered Oct 26 '22 09:10

Sven Marnach


The best way to do this is within MySQL, I think. You can modify your users table schema with another column, a BIT column, for "is unsubscribed". Better yet: add a DATETIME column for "date deleted" with a default value of NULL.

If using a BIT column, your query becomes something like:

SELECT * FROM `users` WHERE `unsubscribed` <> 0b1;

If using a DATETIME column, your query becomes something like:

SELECT * FROM `users` WHERE `date_unsubscribed` IS NULL;
like image 1
Daniel Trebbien Avatar answered Oct 26 '22 07:10

Daniel Trebbien