Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Full-text search across several attributes in Datomic

I've got a model that looks approximately like that (in JSON):

{"gender": "female", 
 "name": [
  {"family": "Smith", 
   "given": ["Samantha"], 
   "middle": ["Lee"]]}}

There are about 6M records with such a structure. I need to provide a full-text search across all the components of a person's name using OR clause. E.g. if a user inputs "smith", I need to check all the given, middle and family names.

In Datomic, I've made a schema:

   {:db/ident       :model/name
    :db/valueType   :db.type/ref
    :db/isComponent true
    :db/cardinality :db.cardinality/many}

   {:db/ident       :model.name/family
    :db/valueType   :db.type/string
    :db/cardinality :db.cardinality/one
    :db/fulltext    true}

   {:db/ident       :model.name/given
    :db/valueType   :db.type/string
    :db/cardinality :db.cardinality/many
    :db/fulltext    true}

   {:db/ident       :model.name/middle
    :db/valueType   :db.type/string
    :db/cardinality :db.cardinality/many
    :db/fulltext    true}          

Pay attention, I provided full-text index for these attributes. Now, when I query by a single attribute, say, family, the performance is great (about 100ms):

(def query-all
  '[:find [(rand 100 ?model) ...]
    :in $ ?search
    :where
    [(fulltext $ :model.name/family ?search) [[?name _ _ _]]]
    [?model :model/name ?name]])

But when I add other conditions using OR clause, the performance falls down dramatically (20 seconds):

(def query-all
  '[:find [(rand 100 ?model) ...]
    :in $ ?search
    :where
    (or
     [(fulltext $ :model.name/family ?search) [[?name _ _ _]]]
     [(fulltext $ :model.name/given ?search) [[?name _ _ _]]]
     [(fulltext $ :model.name/middle ?search) [[?name _ _ _]]])
    [?model :model/name ?name]])

My question is, how could I improve that?

And if we go further, it would be great to also find not by a name only, but by an address's components too. Ideally, there would be the following query (that also works quite slow):

(def query-all
  '[:find [(rand 100 ?model) ...]
    :in $ ?search
    :where
    (or

     (and
      [(fulltext $ :model.name/given ?search) [[?e _ _ _]]]
      [?p :model/name ?e])

     (and
      [(fulltext $ :model.name/middle ?search) [[?e _ _ _]]]
      [?p :model/name ?e])

     (and
      [(fulltext $ :model.name/prefix ?search) [[?e _ _ _]]]
      [?p :model/name ?e])

     (and
      [(fulltext $ :model.name/suffix ?search) [[?e _ _ _]]]
      [?p :model/name ?e])

     (and
      [(fulltext $ :model.name/family ?search) [[?e _ _ _]]]
      [?p :model/name ?e])

     (and
      [(fulltext $ :model.address/city ?search) [[?e _ _ _]]]
      [?p :model/address ?e])

     (and
      [(fulltext $ :model.address/state ?search) [[?e _ _ _]]]
      [?p :model/address ?e]))])

How would I implement this?

like image 581
Ivan Grishaev Avatar asked Mar 04 '18 20:03

Ivan Grishaev


2 Answers

We were in the same situation and ended up using a work around:

We created an attribute that concats all of the other string attributes. Querying using fulltext over that attribute, of course.

like image 97
axs Avatar answered Oct 08 '22 07:10

axs


I think you should not use or, but do four different queries to the db and concatenate the results from those queries. The datomic rules functionality is useful, but ruled queries they tend to explode in terms of realized results.

Remember that at a db-pointer is immutable and will give consistent results for several queries asked to it. This may not be true for fulltext searches always, since the Lucene indexing job runs after the transactions, but for most applications this probably wont matter that much.

like image 23
claj Avatar answered Oct 08 '22 08:10

claj