We have an email service that hosts close to 10000 domains such that we store the headers of messages in a SQL Server database.
I need to implement an application that will search the message body for keywords. The messages are stored as files on a NAS storage system.
As a proof of concept, I had implemented a SQL server based search system were I would parse the message and store all the words in a database table along with the memberid and the messageid. The database was on a separate server to the headers database.
The problem with that system was that I ended up with a table with 600 million rows after processing messages on just one domain. Obviously this is not a very scalable solution.
Since the headers are stored in a SQL Server table, I am going to need to join the messageIDs from the search application to the header table to display the messages that contain the searched for keywords.
Any suggestions on a better architecture? Any better alternative to using SQL server? We receive over 20 million messages a day. We are a small company with limited resources with respect to servers, maintenance etc.
Thanks
have a look at Hadoop. It's complete "map-reduce" framework for working with huge datasets inspired by Google. It think (but I could be wrong) Rackspace is using it for email search for their clients.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With