Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

JGroups nodes on EC2 not talking although they see each other

I'm trying to use Hibernate Search so that all writes to the Lucene index from jgroupsSlave nodes are sent to the jgroupsMaster node, and then the Lucene index is shared back to the slaves with Infinispan. Everything works locally, but while the nodes discover each other up on EC2, they don't seem to be communicating.

They are both sending each other are-you-alive messages.

# master output sample
86522 [LockBreakingService,localCache,archlinux-37498] DEBUG org.infinispan.transaction.TransactionTable  - About to cleanup completed transaction. Initial size is 0
86523 [LockBreakingService,LuceneIndexesLocking,archlinux-37498] DEBUG org.infinispan.transaction.TransactionTable  - About to cleanup completed transaction. Initial size is 0
87449 [Timer-4,luceneCluster,archlinux-37498] DEBUG org.jgroups.protocols.FD  - sending are-you-alive msg to archlinux-57950 (own address=archlinux-37498)
87522 [LockBreakingService,localCache,archlinux-37498] DEBUG org.infinispan.transaction.TransactionTable  - About to cleanup completed transaction. Initial size is 0
87523 [LockBreakingService,LuceneIndexesLocking,archlinux-37498] DEBUG org.infinispan.transaction.TransactionTable  - About to cleanup completed transaction. Initial size is 0

# slave output sample
85499 [LockBreakingService,localCache,archlinux-57950] DEBUG org.infinispan.transaction.TransactionTable  - About to cleanup completed transaction. Initial size is 0
85503 [LockBreakingService,LuceneIndexesLocking,archlinux-57950] DEBUG org.infinispan.transaction.TransactionTable  - About to cleanup completed transaction. Initial size is 0
86190 [Timer-3,luceneCluster,archlinux-57950] DEBUG org.jgroups.protocols.FD  - sending are-you-alive msg to archlinux-37498 (own address=archlinux-57950)
86499 [LockBreakingService,localCache,archlinux-57950] DEBUG org.infinispan.transaction.TransactionTable  - About to cleanup completed transaction. Initial size is 0
86503 [LockBreakingService,LuceneIndexesLocking,archlinux-57950] DEBUG org.infinispan.transaction.TransactionTable  - About to cleanup completed transaction. Initial size is 0

Security Groups

I have two jars, one for master, and one for slave, that I am running on their own EC2 instances. I can ping each instance, from the other, and they are both in the same security group, which defines the following rules for communication between any machines in my group.

ALL ports for ICMP 0-65535 for TCP 0-65535 for UDP

So I don't think it is a security group configuration problem.

hibernate.properties

# there is also a corresponding jgroupsSlave
hibernate.search.default.worker.backend=jgroupsMaster
hibernate.search.default.directory_provider = infinispan
hibernate.search.infinispan.configuration_resourcename=infinispan.xml
hibernate.search.default.data_cachename=localCache
hibernate.search.default.metadata_cachename=localCache

infinispan.xml

<infinispan xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
            xsi:schemaLocation="urn:infinispan:config:5.1 http://www.infinispan.org/schemas/infinispan-config-5.1.xsd"
            xmlns="urn:infinispan:config:5.1">
    <global>
        <transport clusterName="luceneCluster" transportClass="org.infinispan.remoting.transport.jgroups.JGroupsTransport">
            <properties>
                <property name="configurationFile" value="jgroups-ec2.xml" />
            </properties>
        </transport>
    </global>

    <default>
        <invocationBatching enabled="true" />
        <clustering mode="repl">

        </clustering>
    </default>

    <!-- this is just so that each machine doesn't have to store the index
         in memory -->
    <namedCache name="localCache">
        <loaders passivation="false" preload="true" shared="false">
            <loader class="org.infinispan.loaders.file.FileCacheStore" fetchPersistentState="true" ignoreModifications="false" purgeOnStartup="false">
                <properties>
                    <property name="location" value="/tmp/infinspan/master" />
                    <!-- there is a corresponding /tmp/infinispan/slave in
                    the slave config -->
                </properties>
            </loader>
        </loaders>
    </namedCache>
</infinispan>

jgroups-ec2.xml

<config xmlns="urn:org:jgroups" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.2.xsd">
    <TCP
            bind_addr="${jgroups.tcp.address:127.0.0.1}"
            bind_port="${jgroups.tcp.port:7800}"
            loopback="true"
            port_range="30"
            recv_buf_size="20000000"
            send_buf_size="640000"
            max_bundle_size="64000"
            max_bundle_timeout="30"
            enable_bundling="true"
            use_send_queues="true"
            sock_conn_timeout="300"
            enable_diagnostics="false"

            bundler_type="old"

            thread_pool.enabled="true"
            thread_pool.min_threads="2"
            thread_pool.max_threads="30"
            thread_pool.keep_alive_time="60000"
            thread_pool.queue_enabled="false"
            thread_pool.queue_max_size="100"
            thread_pool.rejection_policy="Discard"

            oob_thread_pool.enabled="true"
            oob_thread_pool.min_threads="2"
            oob_thread_pool.max_threads="30"
            oob_thread_pool.keep_alive_time="60000"
            oob_thread_pool.queue_enabled="false"
            oob_thread_pool.queue_max_size="100"
            oob_thread_pool.rejection_policy="Discard"
            />
    <S3_PING secret_access_key="removed_for_stackoverflow" access_key="removed_for_stackoverflow" location="jgroups_ping" />

    <MERGE2 max_interval="30000"
            min_interval="10000"/>
    <FD_SOCK/>
    <FD timeout="3000" max_tries="3"/>
    <VERIFY_SUSPECT timeout="1500"/>
    <pbcast.NAKACK2
            use_mcast_xmit="false"
            xmit_interval="1000"
            xmit_table_num_rows="100"
            xmit_table_msgs_per_row="10000"
            xmit_table_max_compaction_time="10000"
            max_msg_batch_size="100"
            become_server_queue_size="0"/>
    <UNICAST2
            max_bytes="20M"
            xmit_table_num_rows="20"
            xmit_table_msgs_per_row="10000"
            xmit_table_max_compaction_time="10000"
            max_msg_batch_size="100"/>
    <RSVP />
    <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
                   max_bytes="400000"/>
    <pbcast.GMS print_local_addr="false" join_timeout="7000" view_bundling="true"/>
    <UFC max_credits="2000000" min_threshold="0.10"/>
    <MFC max_credits="2000000" min_threshold="0.10"/>
    <FRAG2 frag_size="60000"/>
</config>

I copied this directly from the most recent infinispan-core distribution (5.2.0.Beta3, but I also tried the 5.1.4 I think it was). The only thing I changed was replaced their s3_ping with mine, but again I see the nodes writing to s3, and they find each other so I don't think that is the problem. I'm also starting master/slave with their environment vars for jgroups.tcp.address set to their private IP address. I also tried a few configs that were greatly simplified without any success.

Any ideas in what the problem could be? I've spent a few days playing around with it, and it is driving me crazy. I think it must be something with the jgroups config since it works locally and just isn't being able to talk on EC2.

Any other information you guys want to help figure this out?

like image 327
dustincg Avatar asked Nov 09 '12 17:11

dustincg


1 Answers

You have two JGroups channels being started, so two JGroups configurations to specify: one for Infinispan and one for the backend worker communication.

Both Infinispan and the jgroupsMaster are going to use their default configuration settings unless you specify one, but the defaults use multicast which doesn't work on EC2.

It seems you have a correct configuration setup for the Infinispan index, but you have to reconfigure the jgroupsMaster worker to also use S3_PING or JDBC_PING; it likely is working for you locally as the default configuration is able to autodiscover peers using multicast.

This duplication will be resolved by HSEARCH-882, I'm looking forward to it to significantly simplify configuration.

like image 146
Sanne Avatar answered Nov 07 '22 17:11

Sanne