I am doing some work for an organisation that has offices in 48 countries of the world. Essentially the way they work now is that they all store data in a local copy of the database and that is replicated out to all the regions/offices in the world. On the odd occasion where they need to work directly on something where the "development copy" is on the London servers, they have to connect directly to the London servers, regardless of where they are in the world.
So lets say I want to have a single graph spanning the whole organisation which is sharded so that each region has relatively fast reads of the graph. I am worried that writes are going to kill performance. I understand that writes go through a single master, does that mean there is a single master globally? i.e. if that master happens to be in London then each write to the database from Sydney has to traverse that distance regardless of the local sharding? And what would happen if Sydney and London were cut off (for whatever reason)?
Essentially, how does Neo4j solve the global distribution problem?
With Neo4j 4.0, the customer decides how to split up, or shard, their data into multiple databases, or sub-graphs, that can run on separate systems but be queried as a single entity through the new Fabric server.
Multi-databaseThis capability allows you to operate and manage multiple databases within a single installation of the Neo4j DBMS. The data can be segmented by use case, sensitivity, or business applications into different databases.
With Neo4j, you can achieve unlimited horizontal scalability via sharding for mission-critical applications with a minutes-to-milliseconds performance advantage.
Neo4j is a graph database. A graph database, instead of having rows and columns has nodes edges and properties. It is more suitable for certain big data and analytics applications than row and column databases or free-form JSON document databases for many use cases. A graph database is used to represent relationships.
The distribution mechanism in Neo4j Enterprise edition is indeed master-slave style. Any write request to the master is committed locally and synchronously transferred to the number in slaves defined by push_factor
(default: 1). A write request to a slave will synchronously apply it the master, to itself and to enough machines to fulfill push_factor
. The synchrous slave-to-master communication might hit performance thats why it's recommended to do redirect writes to the master and distribute reads over slaves. The cluster communication works fine on high-latency networks.
In a multi-region setup I'd recommend to have a full (aka minimum 3 instances) cluster in the 'primary region'. Another 3-instance cluster is in a secondary region running in slave-only mode. In case that the primary region goes down completely (happens very rarly but it dows) the monitoring tool trigger a config change in the secondary region to enable its instances to become master. All other offices requiring fast read access have then x (x>=1, depending on read performance) slave-only instances. In each location you have a HA proxy (or other LB) that directs writes to the master (normally in primary region) and reads to the local region.
If you want to go beyond ~20 instances for a single cluster, consider doing a serious proof of concept first. Due to master slave architecture this approach does not scale indefinitly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With