We have a few SolrCloud & ZooKeeper setups running in AWS EC2, and for the most part they're running smoothly, but after a recent failure of one of our ZooKeeper nodes I started wondering if any one method of having the clients address the ZooKeepers was better than others. Our clients are java based using the Solr 4.1 java client.
Originally we were using hostfile entries for identifying the ZooKeepers, but ensuring that the entries in /etc/hosts were up-to-date given the nature of AWS it became very tedious to do so.  So we're now using custom DNS via Route53 to identify the ZooKeepers instead.  But we're still identifying the ZooKeeper nodes individually, so as an example we currently specify this when launching our clients:
-Dsolr.zookeeperHosts='zk-1.mydomain.com:2181,zk-2.mydomain.com:2181,zk-3.mydomain.com:2181'
The hosts zk-1.mydomain.com etc. are simply CNAME'd to the DNS for each ZooKeeper EC2 instance. So now if Amazon forces us to reboot a ZooKeeper, which causes it to get a new IP address, the client will eventually get the new IP when the DNS record is updated.
My question has to do with wondering if there's an even better approach to take in handling this. Suppose we wanted to add additional ZooKeepers into the mix, so we had a quorum of 5 nodes instead of 3. (I actually want to do this.) Would it make more sense to have a single DNS round-robin record that contains all the ZooKeepers in it and pass that single DNS name to the client?
For example, set up the DNS record zookeepers.mydomain.com as a CNAME that points to zk-1.mydomain.com, zk-2.mydomain.com and zk-mydomain.com and then simply pas this to my clients:
-Dsolr.zookeeperHosts='zookeepers.mydomain.com:2181'
This way, when I add new ZooKeepers to the cluster I could simply add another CNAME record to zookeepers.mydomain.com and not need to worry about updating the configs on all the clients.
Is the Solr client smart enough to make use of a DNS record with multiple records in it? Specifically, if one ZooKeeper happens to be down, and the client tries to connect to it, will the client know enough to query DNS again to get the IP of the next ZooKeeper and attempt to communicate with it?
Using CNAME is a good idea but I suggest extending it with Elastic IPs to make them more robust, DNS changes take time to propagate Elastic IPS are way more responsive.
However I do have some word of caution, in our investigations we tried to explore how Zookeeper/Solr would react if instead of using hostnames/ips we used a load balancer and give that to Solr DONT DO THIS! It seems that internally identifies each solr.zookeeperHosts entry as a zookeeper server and when one failed for some reason it invalidated it, since from Solr's perspective there weren't any other Zookeeper servers so Solr wen't down. My guess is you will have the same problem by having a record with several IPs.
The best solution to this is automate as much as possible. In a previous project I used chef to gather all the zookeeper nodes and set the ips/hostname dynamically on each solr node. If chef is to much of a change for you the same can be done using EC2 tags and some clever bash scripting. You can mark your zookeeper instances with a tag and use the aws cli like this to get a list of ips.
 ec2-describe-instances --filter "tag-key=Zookeeper"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With