Part of my graph is constructed using a giant join between two large collections, and I run it every time I add documents to either collection. The query is based on an older post.
FOR fromItem IN fromCollection
FOR toItem IN toCollection
FILTER fromItem.fromAttributeValue == toItem.toAttributeValue
INSERT { _from: fromItem._id, _to: toItem._id, otherAttributes: {}} INTO edgeCollection
This takes about 55,000 seconds to complete for my dataset. I would absolutely welcome suggestions for making that faster.
But I have two related issues:
upsert
would be fine, but in this case, since I have no way of knowing the key up front, it wouldn't help me. To get the key up front, I would need to query by example to find the key of the otherwise identical, existing edge. That seems reasonable as long as it doesn't kill my performance, but I don't know how in AQL to construct my query conditionally so that it inserts an edge if the equivalent edge does not exist yet, but does nothing if the equivalent edge does exist. How can I do this?If you run your query as written without any indexes, then it will have to do two nested full collection scans, as can be seen by looking at the output of
db._explain(<your query here>);
which shows something like:
1 SingletonNode 1 * ROOT
2 EnumerateCollectionNode 3 - FOR fromItem IN fromCollection /* full collection scan */
3 EnumerateCollectionNode 9 - FOR toItem IN toCollection /* full collection scan */
4 CalculationNode 9 - LET #3 = (fromItem.`fromAttributeValue` == toItem.`toAttributeValue`) /* simple expression */ /* collections used: fromItem : fromCollection, toItem : toCollection */
5 FilterNode 9 - FILTER #3
...
If you do
db.toCollection.ensureIndex({"type":"hash", fields ["toAttributeValue"], unique:false})`
Then there will be a single full table collection scan in fromCollection
and for each item found there is a hash lookup in the toCollection
, which will be much faster. Everything will happen in batches, so this should already improve the situation. The db._explain()
will show this:
1 SingletonNode 1 * ROOT
2 EnumerateCollectionNode 3 - FOR fromItem IN fromCollection /* full collection scan */
8 IndexNode 3 - FOR toItem IN toCollection /* hash index scan */
To only work on recently inserted items in fromCollection
is relatively easy: Simply add a timestamp of the import time to all vertices, and use:
FOR fromItem IN fromCollection
FILTER fromItem.timeStamp > @lastRun
FOR toItem IN toCollection
FILTER fromItem.fromAttributeValue == toItem.toAttributeValue
INSERT { _from: fromItem._id, _to: toItem._id, otherAttributes: {}} INTO edgeCollection
and of course put a skiplist index on the timeStamp
attribute in fromCollection
.
This should work beautifully to discover new vertices in the fromCollection
. It will "overlook" new vertices in the toCollection
that are linked to old vertices in fromCollection
.
You can discover these by interchanging the roles of the fromCollection
and the toCollection
in your query (do not forget the index on fromAttributeValue
in fromCollection
) and remembering to only put in edges if the from vertex is old, like in:
FOR toItem IN toCollection
FILTER toItem.timeStamp > @lastRun
FOR fromItem IN fromCollection
FILTER fromItem.fromAttributeValue == toItem.toAttributeValue
FILTER fromItem.timeStamp <= @lastRun
INSERT { _from: fromItem._id, _to: toItem._id, otherAttributes: {}} INTO edgeCollection
These two together should do what you want. Please find the fully worked example here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With