Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In ArangoDB, will querying, with filters, from the neighbor(s) be done in O(n)?

I've been reading Aql Graph Operation and Graphs, and have found no concrete example and performance explanation for the use case of SQL-Traverse.

E.g.:

If I have a collection Users, which has a company relation to collection Company

Collection Company has relation location to collection Location;

Collection Location is either a city, country, or region, and has relation city, country, region to itself.

Now, I would like to query all users who belong to companies in Germany or EU.

SELECT from Users where Users.company.location.city.country.name="Germany";
SELECT from Users where Users.company.location.city.parent.name="Germany";

or

SELECT from Users where Users.company.location.city.country.region.name="europe";
SELECT from Users where Users.company.location.city.parent.parent.name="europe";

Assuming that Location.name is indexed, can I have the two queries above executed with O(n), with n being the number of documents in Location (O(1) for graph traversal, O(n) for index scanning)?

Of course, I could just save regionName or countryName directly in company, as these cities and countries are in EU, unlike in ... other places, won't probably change, but what if... you know what I mean (kidding, what if I have other use cases which require constant update)

like image 414
TruongSinh Avatar asked Dec 10 '15 07:12

TruongSinh


1 Answers

I'm going to explain this using the ArangoDB 2.8 Traversals.

We Create these collections to match your shema using arangosh:

db._create("countries")
db.countries.save({_key:"Germany", name: "Germany"})
db.countries.save({_key:"France", name: "France"})
db.countries.ensureHashIndex("name")

db._create("cities")
db.cities.save({_key: "Munich"})
db.cities.save({_key: "Toulouse")

db._create("company")
db.company.save({_key: "Siemens"})
db.company.save({_key: "Airbus"})

db._create("employees")
db.employees.save({lname: "Kraxlhuber", cname: "Xaver", _key: "user1"})
db.employees.save({lname: "Heilmann",  cname: "Vroni",  _key: "user2"})
db.employees.save({lname: "Leroy",  cname: "Marcel",  _key: "user3"})

db._createEdgeCollection("CityInCountry")
db._createEdgeCollection("CompanyIsInCity")
db._createEdgeCollection("WorksAtCompany")


db.CityInCountry.save("cities/Munich", "countries/Germany", {label: "beautiful South near the mountains"})
db.CityInCountry.save("cities/Toulouse", "countries/France", {label: "crowded city at the mediteranian Sea"})

db.CompanyIsInCity.save("company/Siemens", "cities/Munich", {label: "darfs ebbes gscheits sein? Oder..."})
db.CompanyIsInCity.save("company/Airbus", "cities/Toulouse", {label: "Big planes Ltd."})


db.WorksAtCompany.save("employees/user1", "company/Siemens", {employeeOfMonth: true})
db.WorksAtCompany.save("employees/user2", "company/Siemens", {veryDiligent: true})
db.WorksAtCompany.save("employees/user3", "company/Eurocopter", {veryDiligent: true})

In AQL we would write this query the other way around. We start with the constant time FILTER on the indexed attribute name and start our traversals from there on. Therefor we filter for the country "Germany":

db._explain("FOR country IN countries FILTER country.name == 'Germany' RETURN country ")
Query string:
 FOR country IN countries FILTER country.name == 'Germany' RETURN country 

Execution plan:
 Id   NodeType        Est.   Comment
  1   SingletonNode      1   * ROOT
  6   IndexNode          1     - FOR country IN countries   /* hash index scan */
  5   ReturnNode         1       - RETURN country

Indexes used:
 By   Type   Collection   Unique   Sparse   Selectivity   Fields       Ranges
  6   hash   countries    false    false        66.67 %   [ `name` ]   country.`name` == "Germany"

Optimization rules applied:
 Id   RuleName
  1   use-indexes
  2   remove-filter-covered-by-index

Now that we have our well filtered start node, we do a graph traversal in reverse direction. Since we know that Employees are exactly 3 steps away from the start Vertex, and we're not interested in the path, we only return the 3rd layer:

db._query("FOR country IN countries FILTER country.name == 'Germany' FOR v IN 3 INBOUND country CityInCountry, CompanyIsInCity, WorksAtCompany  RETURN v")

[ 
  { 
    "cname" : "Xaver", 
    "lname" : "Kraxlhuber", 
    "_id" : "employees/user1", 
    "_rev" : "1286703864570", 
    "_key" : "user1" 
  }, 
  { 
    "cname" : "Vroni", 
    "lname" : "Heilmann", 
    "_id" : "employees/user2", 
    "_rev" : "1286729095930", 
    "_key" : "user2" 
  } 
]

Some words about this queries performance:

  • We locate Germany using a hash index is constant time -> O(1)

  • Based on that we want to traverse m many paths where m is the number of employees in Germany; Each of them can be traversed in constant time. -> O(m) at this step.

  • Return the result in constant time -> O(1)

    All combined we need O(m) where we expect m to be less than n (the number of employees) as used in your SQL-Traversal.

like image 114
dothebart Avatar answered Nov 07 '22 15:11

dothebart