I've got a model that looks approximately like that (in JSON):
{"gender": "female",
"name": [
{"family": "Smith",
"given": ["Samantha"],
"middle": ["Lee"]]}}
There are about 6M records with such a structure. I need to provide a full-text search across all the components of a person's name using OR clause. E.g. if a user inputs "smith", I need to check all the given, middle and family names.
In Datomic, I've made a schema:
{:db/ident :model/name
:db/valueType :db.type/ref
:db/isComponent true
:db/cardinality :db.cardinality/many}
{:db/ident :model.name/family
:db/valueType :db.type/string
:db/cardinality :db.cardinality/one
:db/fulltext true}
{:db/ident :model.name/given
:db/valueType :db.type/string
:db/cardinality :db.cardinality/many
:db/fulltext true}
{:db/ident :model.name/middle
:db/valueType :db.type/string
:db/cardinality :db.cardinality/many
:db/fulltext true}
Pay attention, I provided full-text index for these attributes. Now, when I query by a single attribute, say, family
, the performance is great (about 100ms):
(def query-all
'[:find [(rand 100 ?model) ...]
:in $ ?search
:where
[(fulltext $ :model.name/family ?search) [[?name _ _ _]]]
[?model :model/name ?name]])
But when I add other conditions using OR clause, the performance falls down dramatically (20 seconds):
(def query-all
'[:find [(rand 100 ?model) ...]
:in $ ?search
:where
(or
[(fulltext $ :model.name/family ?search) [[?name _ _ _]]]
[(fulltext $ :model.name/given ?search) [[?name _ _ _]]]
[(fulltext $ :model.name/middle ?search) [[?name _ _ _]]])
[?model :model/name ?name]])
My question is, how could I improve that?
And if we go further, it would be great to also find not by a name only, but by an address's components too. Ideally, there would be the following query (that also works quite slow):
(def query-all
'[:find [(rand 100 ?model) ...]
:in $ ?search
:where
(or
(and
[(fulltext $ :model.name/given ?search) [[?e _ _ _]]]
[?p :model/name ?e])
(and
[(fulltext $ :model.name/middle ?search) [[?e _ _ _]]]
[?p :model/name ?e])
(and
[(fulltext $ :model.name/prefix ?search) [[?e _ _ _]]]
[?p :model/name ?e])
(and
[(fulltext $ :model.name/suffix ?search) [[?e _ _ _]]]
[?p :model/name ?e])
(and
[(fulltext $ :model.name/family ?search) [[?e _ _ _]]]
[?p :model/name ?e])
(and
[(fulltext $ :model.address/city ?search) [[?e _ _ _]]]
[?p :model/address ?e])
(and
[(fulltext $ :model.address/state ?search) [[?e _ _ _]]]
[?p :model/address ?e]))])
How would I implement this?
We were in the same situation and ended up using a work around:
We created an attribute that concats all of the other string attributes. Querying using fulltext over that attribute, of course.
I think you should not use or
, but do four different queries to the db
and concatenate the results from those queries. The datomic rules functionality is useful, but ruled queries they tend to explode in terms of realized results.
Remember that at a db
-pointer is immutable and will give consistent results for several queries asked to it. This may not be true for fulltext searches always, since the Lucene indexing job runs after the transactions, but for most applications this probably wont matter that much.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With