Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to search keyword in 100 billions of posts?

This is a college project:

I have a database ( mysql or postgresql doesn't matter ) with 100 billion of posts and I need to search ( as fast as possible ) a generic keyword.

Every post is 500-1000 keywords.

This isn't only a database issue but also a software ( for indexing or other ) issue.

How can I do that ?

I could use some advanced search engine technology but I don't know which.

like image 673
xRobot Avatar asked Sep 04 '10 12:09

xRobot


4 Answers

Have you considered using Apache Lucene ?

That does not directly work directly with your SQL database, you'll have to write some code to feed the documents into it in order to build and index which you then can query.

I don't know however how much additional space that would require and how much time it will take.

like image 96
Andre Holzner Avatar answered Nov 10 '22 16:11

Andre Holzner


Sell the "100 billion" post data to Google. :)

They will index it for you free and you will make money.

like image 40
2 revs Avatar answered Nov 10 '22 16:11

2 revs


Well, there are about 6.8 billion people on the planet, who can read about 1 post per minute (on average).

If everyone contributes, 100 billion divided by 6.8 billion is 14.7 minutes to have all posts read.

So:

1) Conquer Earth.
2) Make everyone your slave.
3) Have posts read.
4) ???
5) Profit!

like image 5
Amy B Avatar answered Nov 10 '22 16:11

Amy B


You might want to check out Sphinx. It's a full text search engine that handles distributed indexes. You can have parts of the data spread over many computers. And querying a single server can send the query to the other servers, and gather the results from each. It has pretty good speed, but you probably can't do 100 billion posts on a single machine.

You probably won't be able to do something like this in MySQL or Postgresql. While you could store all the data, MySQL and Postgres lack the full text indexing and search speed that a real full text index will bring you.

MySQL has a way to compile in support for the Sphinx storage engine, which, although the data will still be stored in Sphinx separate from MySQL, you can still query the Sphinx search engine using anything that connects to MySQL, and also perform joins to other tables that are in your MySQL database. However, if you just want to perform simple searches on the documents, and don't require joining to other data, you can just use the native PHP interface.

like image 4
Kibbee Avatar answered Nov 10 '22 14:11

Kibbee