Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hybrid of fulltext and property-based search engine

Background:

SQL database representing different types of events (concerts, football matches, charity collections etc.), where each contain event-related data (concert - artist name, match - host/visitor team). All of these events inherit from one, general table event, which contains data related to all of them (name, description, location, start/end date). Inheritance is implemented using table-per-subclass model known from Hibernate or Doctrine. The database also stores tables artists (id, name, birth_date) and football_teams (id, name, country, coach_name) used in event_concerts and event_football_matches tables (through FKs).

Problem:

Create a search engine that given some criteria ({name: "manchester", startDate: "01.01.2012 - 01.02.2012"} or {location: "london", description: "artists +metallica -bieber"}) will return all events that meet the criteria, as well as results from artists/football_teams tables.

Some properties of those events contain large pieces of text, that should be searched through in fulltext-search manner.

Example:

Given following search criteria:

{ location: "london", startDate: "05.11.2012 - 07.11.2012" }

Search engine should return:

  1. (football event) Arsenal vs Manchester United match, Emirates Stadium, London, 06.11.2012
  2. (concert event) Metallica concert, Some-Fancy-Location, 05.11.2012
  3. (football team/not an event) Arsenal, founded: 1886, league: Premier League
  4. (football team/not an event) Chelsea, founded: 1905, league: Premier League
  5. (festival event) Halloween in London, 07.11.2012
  6. (dance event) Sleeping Beauty at Sadler's Wells, £45, 07.11.2012
  7. (musician, not an event) Neil Christian, 1943 - 2012, Rock'n'Roll vocalist

As you can see, startDate (event-related property) is considered only in case of events.


Search engine has to scan lots of tables, that's why I believe I should use dedicated software (Sphinx, Lucene, ...?) and create separate index just for the searching.


Could anyone suggest some solution for building such an index? What software could I use as a base for that search engine?


EDIT:

Just to clarify: none of the properties is required. Some of them contain dates which will be searched using exact-match, some of them contain short text (like a location) that also will be searched using exact-match. But some of them contain long pieces of text, and that needs to be searched in full-text manner.

like image 564
Crozin Avatar asked Oct 14 '12 11:10

Crozin


4 Answers

I see three approaches.

  • Migrate to Couchdb. To make the location + start time search, you build a view using [location, event_start_time] as key. During search time, you query by using ?startkey=["london,"05.11.2012"]&endkey=["london", "07.11.2012"]. To enable full text search on description, you build a specific view like this.

  • Build ElasticSearch index. The search in your example is just a structured query using two data fields, in which location is a must, and event start time is a range. Full-text-search in elasticsearch is more powerful, you can define specific analyzer to process text in your data. Elasticsearch also supports search with geospatial data.

  • Use Amazon CloudSearch. There are some known limits. I list some I did not like:

    • Document size is less than 1MB.
    • Up to 100 values can be specified in a field.
    • Not open-sourced.

For some comparisons between search implementations:

  • Solr vs. ElasticSearch
  • https://stackoverflow.com/questions/11115523/how-does-amazon-cloudsearch-compares-to-elasticsearch-solr-or-sphinx-in-terms-o
  • http://www.searchenginecaffe.com/2012/04/amazon-cloudsearch-elastic-search-as.html
like image 50
greeness Avatar answered Oct 04 '22 08:10

greeness


You can use a search table with a tsearch2 column that will allow you to do the fulltext search and every column that you need to do a query against : name and location for instance.

If your tables inherit all from the same model it will be easier to reference them in the search table. If not, you can build a content type table to put in place generic foreign keys which will be used in your search table to reference the result rows which will allow eventually to retrieve the results of the query. The table is populated using triggers. This might be the most efficient solution especially if you know SQL (maybe PL/SQL) well.

If it's not the case, I recommend you use Lucene to build the index you need in Java, and query it as you wish.

like image 23
amirouche Avatar answered Oct 04 '22 10:10

amirouche


I'm assuming you are not using a databse with native full text implementation so this is a poor man's solution but it's something to get you going and delivery something before you go to a full blown solution such as Lucene or any other search services suggested that might need a longer learning time.

You could simply have a scheduled process running from time to time to get all the data from all these different tables and dump them into an "index" table where you could perform the search without all the joins.

You could have a table such as:

eventId - int
keywords - nvarchar(max)
location - nvarchar
start_date - datetime
end_date - datetime

This way you would have everything you need to perform the search in one place. For the search itself you could probably get a decent result with like operator in a query.

Not the best search platform a man can dream of but something that you can get working in a few hours.

like image 23
tucaz Avatar answered Oct 04 '22 08:10

tucaz


You have a lot of different requirements, and no software will be able to fulfill all of them.

  1. Custom search syntax
  2. Multiple entities search
  3. Fulltext search
  4. Semantic search (?)

Let us talk about every one of them.

Custom search syntax

You need a system that manages to construct a well-formatted query based on user input. This system manages not only to do +metallica -bieber -> text CONTAIN metallica AND text NOT CONTAIN bieber

Your example is not entirely accurate. How would the system manage to find "Arsenal", based on 'location: London' ? How does the system manage to find "Neil Christian" based on 'location: London' ?

Solution: I am pretty sure systems like Lucene support rich syntax on search queries. However, it will be the framework syntax, not your own. Please note that specifying your own syntax is difficult; you will need to construct a lexer, parser and your own translation into the query tree of your search framework.

Multiple entities search

You need to search multiple entities at once. This is no problem for search frameworks.

However, you also want to define the search structure and link this back to your entities. This will be a little bit harder, but not impossible to do.

Fulltext search

This is also pretty simple to do.

Semantic search

Advanced search frameworks should have support for semantic search (e.g. London is related to Queen, since they occur frequently together in some texts). This may be dependent on your training dataset though.

Conclusion

The thing which is hard to do is not the inheritance, nor is it the full text search. You should however really define how the data is matched. Search systems are not magic hocus pocus, although Google may make it look that way. Search depends on real mathematics; you should specify these mathematical rules on your syntax before going any further on this.

like image 28
parasietje Avatar answered Oct 04 '22 08:10

parasietje