I have a hypothetical situation that I'd like to solve, but I can't find the ideal answer. Suppose you have a huge data set that could be returned from a query, how do you paginate it so that the impact on memory is minimal? The datoms API, iterating over the datoms and filtering one by one? The index-range API, but I would have to do the same thing as in the datoms API, iterate over the items and filter one by one? Perform an initial query that would return only ids, and the paginate those ids so that they could be used in another query to retrieve the entire data set?
In SQL you usually can define a pagination in the query itself:
SELECT col1, col2, ...
FROM ...
WHERE ...
ORDER BY -- this is a MUST there must be ORDER BY statement
-- the paging comes here
OFFSET 10 ROWS -- skip 10 rows
FETCH NEXT 10 ROWS ONLY; -- take 10 rows
There are many things to consider.
First, at the time of writing, the Datalog implementation that ships with Datomic is eager, and does not spill to disk, which means the result set of a Datalog query must fit in memory.
This does not mean that Datalog is incompatible with a large result, because you can have each Datalog query deal only with a small part of the data. For instance, you can use Datalog to compute the 'logical' part of the query (what entities to return), and the Entity API or the Pull API to (lazily) compute the 'content' part of the query (what attributes to return for each entity). Given that an Entity Id is just a Java Long (8 bytes), this can save you one of two orders of magnitude of memory footprint. Example using the Entity API:
(defn export-customers
[db search-criteria]
(->>
;; logical part - Datalog-based, eager
(d/q '[:find [?customer ...] :in % $ ?search-criteria :where
(customer-matches-criteria ?search-criteria ?customer)]
(my-rules) db search-criteria)
;; content part - Entity API based, lazy
(map (fn [eid]
(let [customer (d/entity db eid)]
(select-keys customer
[:customer/id
:customer/email
:customer/firstName
:customer/lastName
:customer/subscription-time]))))
))
You can complement this approach by eagerly storing the whole result in a secondary blob store, and then poll against that for pagination.
If your query logic is not too complex, you could also imagine not using Datalog at all, e.g by using raw index access (e.g using the Datoms API or the Index Range API) in a lazy way.
Finally, you should consider that maybe Datomic is not the right fit for servicing your analytical queries. Because change detection is trivial with Datomic, it's fairly easy to stream derived data to secondary stores that will be better equipped to compute analytical queries (e.g ElasticSearch, Google BigQuery, PostgreSQL, etc.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With