I've been reading Aql Graph Operation and Graphs, and have found no concrete example and performance explanation for the use case of SQL-Traverse.
E.g.:
If I have a collection Users, which has a company relation to collection Company
Collection Company has relation location to collection Location;
Collection Location is either a city, country, or region, and has relation city, country, region to itself.
Now, I would like to query all users who belong to companies in Germany or EU.
SELECT from Users where Users.company.location.city.country.name="Germany";
SELECT from Users where Users.company.location.city.parent.name="Germany";
or
SELECT from Users where Users.company.location.city.country.region.name="europe";
SELECT from Users where Users.company.location.city.parent.parent.name="europe";
Assuming that Location.name is indexed, can I have the two queries above executed with O(n), with n being the number of documents in Location (O(1) for graph traversal, O(n) for index scanning)?
Of course, I could just save regionName or countryName directly in company, as these cities and countries are in EU, unlike in ... other places, won't probably change, but what if... you know what I mean (kidding, what if I have other use cases which require constant update)
I'm going to explain this using the ArangoDB 2.8 Traversals.
We Create these collections to match your shema using arangosh:
db._create("countries")
db.countries.save({_key:"Germany", name: "Germany"})
db.countries.save({_key:"France", name: "France"})
db.countries.ensureHashIndex("name")
db._create("cities")
db.cities.save({_key: "Munich"})
db.cities.save({_key: "Toulouse")
db._create("company")
db.company.save({_key: "Siemens"})
db.company.save({_key: "Airbus"})
db._create("employees")
db.employees.save({lname: "Kraxlhuber", cname: "Xaver", _key: "user1"})
db.employees.save({lname: "Heilmann", cname: "Vroni", _key: "user2"})
db.employees.save({lname: "Leroy", cname: "Marcel", _key: "user3"})
db._createEdgeCollection("CityInCountry")
db._createEdgeCollection("CompanyIsInCity")
db._createEdgeCollection("WorksAtCompany")
db.CityInCountry.save("cities/Munich", "countries/Germany", {label: "beautiful South near the mountains"})
db.CityInCountry.save("cities/Toulouse", "countries/France", {label: "crowded city at the mediteranian Sea"})
db.CompanyIsInCity.save("company/Siemens", "cities/Munich", {label: "darfs ebbes gscheits sein? Oder..."})
db.CompanyIsInCity.save("company/Airbus", "cities/Toulouse", {label: "Big planes Ltd."})
db.WorksAtCompany.save("employees/user1", "company/Siemens", {employeeOfMonth: true})
db.WorksAtCompany.save("employees/user2", "company/Siemens", {veryDiligent: true})
db.WorksAtCompany.save("employees/user3", "company/Eurocopter", {veryDiligent: true})
In AQL we would write this query the other way around.
We start with the constant time FILTER
on the indexed attribute name
and start our traversals from there on.
Therefor we filter for the country "Germany":
db._explain("FOR country IN countries FILTER country.name == 'Germany' RETURN country ")
Query string:
FOR country IN countries FILTER country.name == 'Germany' RETURN country
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
6 IndexNode 1 - FOR country IN countries /* hash index scan */
5 ReturnNode 1 - RETURN country
Indexes used:
By Type Collection Unique Sparse Selectivity Fields Ranges
6 hash countries false false 66.67 % [ `name` ] country.`name` == "Germany"
Optimization rules applied:
Id RuleName
1 use-indexes
2 remove-filter-covered-by-index
Now that we have our well filtered start node, we do a graph traversal in reverse direction. Since we know that Employees
are exactly 3 steps away from the start Vertex, and we're not interested in the path, we only return the 3rd layer:
db._query("FOR country IN countries FILTER country.name == 'Germany' FOR v IN 3 INBOUND country CityInCountry, CompanyIsInCity, WorksAtCompany RETURN v")
[
{
"cname" : "Xaver",
"lname" : "Kraxlhuber",
"_id" : "employees/user1",
"_rev" : "1286703864570",
"_key" : "user1"
},
{
"cname" : "Vroni",
"lname" : "Heilmann",
"_id" : "employees/user2",
"_rev" : "1286729095930",
"_key" : "user2"
}
]
Some words about this queries performance:
We locate Germany using a hash index is constant time -> O(1)
Based on that we want to traverse m many paths where m is the number of employees in Germany; Each of them can be traversed in constant time. -> O(m) at this step.
Return the result in constant time -> O(1)
All combined we need O(m) where we expect m to be less than n (the number of employees) as used in your SQL-Traversal.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With