Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to optimize "text search" for inverted index and relational database? [closed]

Update 2020-02-18

I came across this question again and although the accepted answer remains the same, I thought of sharing how I would optimize it nowadays (again without using third party libraries or tools - ie from scratch re-inventing the wheel like mentioned in the original question).

To simplify and optimize this system, I would use on the domain logic tier a Trie (Prefix tree) instead of the "inverted index maps" and discard completely the bad practice of "Table Queries" SQL table. I will illustrate with an example:

  • Say some users of the app already added a couple of objects to database: w, wo, woo, wood and woodx.
  • Those objects (which are strings/labels) will be represented by a Trie in memory, each Trie node will contain the Database issued ID of that object at the level where it is in the tree (Think Associative Array).
  • When user queries a word, we search that word in the Trie and accumulate all relevant IDs on the way, starting from the searched word and moving down (ie In order traversal from there). From these IDs we retrieve all needed object data (whether from Cache or DB).

Here is a picture to illustrate: Trie

  • Next if a user adds a new word to the database, say for example "woodxe", the Trie is updated accordingly.
  • When a user queries "woodx" the same process as before occurs and a new ID is accumulated in addition ("woodxe"'s issued ID)
  • There is a finite list of words in the English dictionary that starts with a specific sequence of letters, so moving down and getting all sub nodes will still be a finite process of O(1) complexity. For example if you start with "wood" on the Trie, the list of sub nodes having "wood" as prefix in the English dictionary is a finite constant number. Whether you return all these sub nodes to user, define a limit (lazy loading/ paging) or only display top 10 hits, is a personal architectural preference.

Here is a picture to illustrate (check what is added in green) Trie - Continued

  • When user's query is a multi word string ie like: "wood furniture" each word is parsed/added separately to the Trie, and each will have a list of matching IDs accordingly.

How does a *Trie* improve the previous architecture?

  • "Table Queries" which was cumbersome, bad practice and a big overhead that grows in a directly proportional fashion to the database; is now removed.
  • The "Inverted Indexes Maps" that we had, created an extra memory overhead and could not be easily extended by new words (like the "woodx" example above). One could argue that querying a Hashmap is O(1) however having several large hashmaps in memory actually slows things down at a certain level and is considered bad engineering design.
  • A trie has a search complexity of O(m) where m is the numbers of characters in provided alphabet. Since what the user queries is purely words using the English alphabet letters, the largest sub tree will be equal to the largest English word available (constant number ie O(1)). Furthermore, as mentioned before, the number of sub nodes starting with a defined word prefix in the English dictionary is also a constant number so traversal on all combinations is O(1). So in Total it is an O(1) operation.
    • So querying Trie = Get key from Hashmap = O(1) which is as fast.
    • On top of that the benefit of a trie in this system will be:
      • Smaller memory overhead than having multiple inverted index hash maps running in memory
      • Centralized querying tree
      • Easy extensibility where new words added to the database require just adding a couple of new nodes to the existing Trie in memory. ie, no more issues with the database growing and searching queries increasing in numbers (scalability nightmare).

Update 2015-10-15

Back in 2012, I was building a personal online application and actually wanted to re-invent the wheel because am curious by nature, for learning purposes and to enhance my algorithm and architecture skills. I could have used apache lucene and others, however as I mentioned I decided to build my own mini search engine.

Question: So is there really no way to enhance this architecture except by using available services like elasticsearch, lucene and others?


Original question

I am developing a web application, in which users search for specific titles (say for example : book x, book y, etc..) , which data is in a relational database (MySQL).

I am following the principle that each record that was fetched from the db, is cached in memory , so that the app has less calls to the database.

I have developed my own mini search engine , with the following architecture:
Architecture diagram
This is how it works:

  • a) User searches a record name
  • b) The system check what character the query starts with, checks if query there : get record. If not there, adds it and get all matching records from database using two ways:
    • Either query already there in the Table "Queries" (which is a sort of history table) thus get record based on IDs (Fast performance)
    • Or, otherwise using Mysql LIKE %% statement to get records/ids (Also then keep the used query by the user in history table Queries along with the ids it maps to).
      -->Then It adds records and their ids to the cache and Only the ids to the inverted index map.
  • c) results are returned to the UI

The system works fine, however I have Two main issues, that i couldn't find a good solution for (been trying for the past month):

First issue:
if you check point (b) , case where no query "history" is found and it has to use the Like %% statement : this process becomes time consuming when the query matches numerous records in the database (instead of one or two):

  • It will take some time to get records from Mysql (this is why i used INDEXES on the specific columns)
  • Then time to save query history
  • Then time to add records/ids to cache and inverted index maps

Second issue:
The application allows users to add themselves new records, that can immediately be used by other users logged in the to application.
However to achieve this, inverted index map and table "queries" have to be updated so that in case any old query matches to the new word. For example if a new record "woodX" is being added, still the old query "wood" does map to it. So in order to re-hook query "wood" to this new record, here is what i am doing now:

  • new record "woodX" gets added to "records" table
  • then i run a Like %% statement to see which already existing query in table "queries" does map to this record(for example "wood"), then add this query with the new record id as a new row: [ wood, new id].
  • Then in memory, update inverted index Map's "wood" key's value (ie the list), by adding the new record Id to this list

--> Thus now if a remote user searches "wood" it will get from memory : wood and woodX

The Issue here is also time consumption. Matching all query histories (in table queries) with the newly added word takes a lot of time (the more matching queries, the more time). Then the in memory update also takes a lot of time.

What i am thinking of doing to fix this time issue, is to return the desired results to the user first , then let the application POST an ajax call with the required data to achieve all these UPDATE tasks. But i am not sure if this is a bad practice or an unprofessional way of doing things?
So for the past month ( a bit more) i tried to think of the best optimization/modification/update for this architecture, but I am not an expert in the document retrieval field (actually its my first mini search engine ever built).

I would appreciate any feedback or guidance on what i should do to be able to achieve this kind of architecture.
Thanks in advance.

PS:

  • Its a j2ee application using servlets.
  • I am using MySQL innodb (thus i cannot use full-text search option)

like image 600
ccot Avatar asked May 30 '12 16:05

ccot


People also ask

Does Google use inverted index?

Searching through individual pages for keywords and topics would be a very slow process for search engines to identify relevant information. Instead, search engines (including Google) use an inverted index, also known as a reverse index.

Why is inverted index faster?

Inverted Indexes Speed Up Search Unlike just a keyword search, an inverted index allows you to search the inherent structure of any document. There's no need to use a table name or special query language to get the information you want. You just type it into a search box and the search engine figures out the rest.

Where is inverted index stored?

The inverted index is typically stored on the disk and is loaded on a dynamic basis depending on the query... e.g. if the query is "stack overflow", you hit on the individual lists corresponding to the terms 'stack' and 'overflow'...

What is inverted full-text indexing?

The inverted index is a data structure that allows efficient, full-text searches in the database. It is a very important part of information retrieval systems and search engines that stores a mapping of words (or any type of search terms) to their locations in the database table or document.


2 Answers

I would strongly recommend Sphinx Search Server, wchich is best optimized in full-text searching. Visit http://sphinxsearch.com/.

It's designed to work with MySQL, so it's an addition to Your current workspace.

like image 125
ad4s Avatar answered Oct 14 '22 12:10

ad4s


I do not pretend to have THE solution but here is my ideas. First, I though like you for time-consuming queries LIKE%% : I would execute a query limited to a few answers in MySQL, like a dozen, return that to user, and wait to see if user wants more matching records, or launch in background the full-query, depending on you indexation needs for future searches.

More generally, I think that storing everything in memory could lead, one day, to too-much memory consumption. And althrough the search-engine becomes faster and faster when it keeps everything in memory, you'll have to keep all these caches up-to-date when data is added or updated and it will certainly take more and more time.

That's why I think the solution I saw a day in an "open-source forum software" (I couldn't remember its name) is not too bad for text searching in posts : each time a data is inserted, a table named "Words" keeps tracks of every existing word, and another table (let's say "WordsLinks") the link between each word and posts it appears in. This kind of solution has some drawbacks:

  • Each Insert, Delete, Update in database is a lot slower
  • Data selection for search engine must be anticipated : if you choose to keep two letter words you never kept, it is too late for already recorded data, unless you launch a complete data re-processing.
  • You must take care of DELETE as well as UPDATE and INSERT

But I think there are some big advantages:

  • Computing time is probably the same than the "memory solution" (eventually), but it is divided in each database Create/Update/Delete, rather than at query time.
  • Looking for a whole word, or words "starting with" is instantaneous : when indexed, searching in "Words" table is dichotomic. And "WordLinks" table query is very fast either with an index.
  • Looking for multiple words at the same time could be simple : gather a group of "WordLinks" for each found Word, and execute an intersection on them to keep only "Database Ids" common to all these groups. For example with the words "tree" and "leaf", the first one could give Table records {1, 4, 6}, and the second one could give {1, 3, 6, 9}. So with an intersection it is simple to keep only common parts : {1, 6}.
  • A "Like %%" in a single-column table is probably faster than a lot of "Like %%" in different fields of different tables. And each database engine handles some cache : "Words" table could be little enough to be kept in memory
  • I think there is a small risk of performance and memory problems if data becomes huge.
  • As every search is fast, you can even look for synonyms. For example search "network" if user didn't find anything with "ethernet".
  • You can apply rules, like splitting camel case words to generate for example the 3 words "wood", "X", "woodX" from "woodX". Each "word" is very lightweight to store and find, so you can do a lot of things.

I think the solution you need could be a blend of methods : for example you can keep lightweight UPDATE, INSERT, DELETE, and launch "Words" and "WordsLinks" feeding from a TRIGGER.

Just for anecdote, I saw a software developped by my company in which it was decided to keep "everything" (!) in memory. It leads us to recommend to our customers to buy servers with 64GB RAM. A little bit expensive. It explains why I am very prudent when I see solutions that could lead, eventually, to memory filling.

like image 41
Elo Avatar answered Oct 14 '22 13:10

Elo