Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MongoDB - how to handle write failures during primary re-election with Spring?

I configured my MongoDB replica set with Spring, and I'm trying to test the auto-failover. I know that if the primary goes down, it takes a few seconds for a new primary to be elected, so in that time period, all writes will fail.

I have a test application that writes to the db every 1 sec, and when I take down the primary, I get a java.io.IOException (because there's no primary to write to). If I restart my application the writes are executed without a problem to the new primary.

I thought that the MongoDB Java driver can handle those cases using retries (was I wrong?), but I was unable to configure Spring to do that, so I'd appriciate some help. :)

My configuration is like so:

<mongo:mongo id="mongo" replica-set="host1:27017,host2:27017,host3:27017">
    <mongo:options
        connections-per-host="8"
        threads-allowed-to-block-for-connection-multiplier="4"
        connect-timeout="1000"
        max-wait-time="1500"
        auto-connect-retry="true"
        socket-keep-alive="true"
        socket-timeout="1500"
        slave-ok="true"
        write-number="1"
        write-timeout="0"
        write-fsync="true"/>
</mongo:mongo>

<mongo:repositories base-package="my.repositories" />

<mongo:db-factory dbname="my_db" mongo-ref="mongo" />

<bean id="mongoTemplate" class="org.springframework.data.mongodb.core.MongoTemplate">
    <constructor-arg name="mongoDbFactory" ref="mongoDbFactory" />
</bean>

Thanks!

like image 979
Ayelet Avatar asked Dec 01 '13 13:12

Ayelet


1 Answers

Here is an initial stab at a spring aop/spring retry custom RetryPolicy for generic retry in various circumstances. This is quite brittle (as it uses exception messages, etc which are subject to change). I would recommend robust testing, and definitely repeating on change of MongoDB and/or java driver version.

Firstly, maven dependancies used:

    <dependencies>
    <dependency>
        <groupId>org.mongodb</groupId>
        <artifactId>mongo-java-driver</artifactId>
        <version>2.11.3</version>
    </dependency>
    <dependency>
        <groupId>org.springframework.data</groupId>
        <artifactId>spring-data-mongodb</artifactId>
        <version>1.3.2.RELEASE</version>
    </dependency>
    <dependency>
        <groupId>org.aspectj</groupId>
        <artifactId>aspectjweaver</artifactId>
        <version>1.6.2</version>
    </dependency>
    <dependency>
        <groupId>org.aspectj</groupId>
        <artifactId>aspectjrt</artifactId>
        <version>1.6.2</version>
    </dependency>
    <dependency>
        <groupId>org.aspectj</groupId>
        <artifactId>aspectjtools</artifactId>
        <version>1.6.2</version>
    </dependency>
    <dependency>
        <groupId>org.springframework.retry</groupId>
        <artifactId>spring-retry</artifactId>
        <version>1.0.3.RELEASE</version>
    </dependency>

</dependencies>

Second, a custom org.springframework.retry.RetryPolicy

import org.springframework.retry.RetryContext;
import org.springframework.retry.policy.SimpleRetryPolicy;

import java.util.HashMap;
import java.util.Map;
import java.util.logging.Logger;

public class CustomMongoDBRetryPolicy extends SimpleRetryPolicy {
    private static final Logger logger = Logger.getLogger(CustomMongoDBRetryPolicy.class.getName());
    public CustomMongoDBRetryPolicy(int maxAttempts) {
        super(maxAttempts, createRetryableExceptions(), true);
    }

    private static Map<Class<? extends Throwable>, Boolean> createRetryableExceptions() {
        HashMap<Class<? extends Throwable>, Boolean> classBooleanHashMap = new HashMap<Class<? extends Throwable>, Boolean>();
        classBooleanHashMap.put(org.springframework.dao.DataAccessResourceFailureException.class, true);
        classBooleanHashMap.put(org.springframework.data.mongodb.UncategorizedMongoDbException.class, true);
        classBooleanHashMap.put(com.mongodb.MongoException.class, true);
        classBooleanHashMap.put(java.net.ConnectException.class, true);
        return classBooleanHashMap;
    }

    @Override
    public boolean canRetry(RetryContext context) {
        boolean retry = super.canRetry(context);
        if (retry) {
            @SuppressWarnings("ThrowableResultOfMethodCallIgnored")
            Throwable lastThrowable = context.getLastThrowable();
            if (lastThrowable != null) {
                String message = lastThrowable.getMessage();
                Throwable cause = lastThrowable.getCause();
                if (message != null) {
                    if (message.startsWith("No replica set members available in")) {
                        logger.info("Retrying because no replica set members available. "+message);
                        return true;
                    }
                    if (message.startsWith("not talking to master and retries used up")) {
                        logger.info("Retrying because no master. "+message);
                        return true;
                    }
                    if (message.startsWith("can't find a master")) {
                        logger.info("Retrying because no master. "+message);
                        return true;
                    }
                    if (message.matches("Read operation to server [^\\s]* failed on database .*")) {
                        logger.info("Retrying because read operation failed. "+message);
                        return true;
                    }
                }
                if (cause != null) {
                    String causeMessage = cause.getMessage();
                    if (causeMessage != null) {
                        if (causeMessage.startsWith("Connection refused")) {
                            logger.info("Retrying because connection not available. "+causeMessage+"("+message+")");
                            return true;
                        }
                    }
                }
                logger.info("Not retrying. "+message+" "+lastThrowable.getClass().getName());
                return false;
            }
        }
        return retry;
    }
}

Finally, tie into Dao using spring AOP

<aop:config proxy-target-class="false">
    <aop:pointcut id="retry"
                  expression="execution(* IMyDao.count(..))" />
    <aop:pointcut id="retry2"
                  expression="execution(* IMyDao.insert(..))" />
    <aop:advisor pointcut-ref="retry"
                 advice-ref="retryAdvice" order="-1"/>
    <aop:advisor pointcut-ref="retry2"
                 advice-ref="retryAdvice" order="-1"/>
</aop:config>

The following combines org.springframework.retry.backoff.ExponentialBackOffPolicy, to delay retries, org.springframework.retry.policy.TimeoutRetryPolicy, to limit retry time and the CustomMongoDBRetryPolicy, which retries what seems to be retry-able...

<bean id="retryAdvice"
      class="org.springframework.retry.interceptor.RetryOperationsInterceptor">
    <property name="retryOperations">
        <bean class="org.springframework.retry.support.RetryTemplate">
            <property name="retryPolicy">
                <bean class="org.springframework.retry.policy.CompositeRetryPolicy">
                    <property name="optimistic" value="false"/>
                    <property name="policies">
                        <set>
                            <bean class="org.springframework.retry.policy.TimeoutRetryPolicy">
                                <property name="timeout" value="20000"/>
                            </bean>
                            <bean class="CustomMongoDBRetryPolicy">
                                <constructor-arg value="100"/>
                            </bean>
                        </set>
                    </property>
                </bean>
            </property>
            <property name="listeners">
                <set>
                    <bean class="MyRetryListener"/>
                </set>
            </property>
            <property name="backOffPolicy">
                <bean class="org.springframework.retry.backoff.ExponentialBackOffPolicy">
                    <property name="initialInterval" value="500"/>
                    <property name="maxInterval" value="8000"/>
                    <property name="multiplier" value="1.5"/>
                </bean>
            </property>
        </bean>
    </property>
  </bean>

Ive tested this with various scenarios, and it seem to be handling most pretty well. But whether it will work for a particular application, needs to be answered on a case by case basis.

  • Initial replicaset start - regular autoreconnect handles before the servers are listening, this handles prior to primary election - all invisible to the application (bar a long lag)
  • Killing the primary - write operation in progress fails to the application, subsequent retry
  • Stepping down the primary, shutting down the primary - as killing the primary.
  • Full replicaset restart (if fast enough)

Hope this helps

like image 101
Alan Spencer Avatar answered Oct 15 '22 01:10

Alan Spencer