Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Azure SQL failover group, what does the grace period mean?

I am currently reading this: https://learn.microsoft.com/en-us/azure/sql-database/sql-database-auto-failover-group, and I have a hard time understanding the automatic failover policy:

By default, a failover group is configured with an automatic failover policy. The SQL Database service triggers failover after the failure is detected and the grace period has expired. The system must verify that the outage cannot be mitigated by the built-in high availability infrastructure of the SQL Database service due to the scale of the impact. If you want to control the failover workflow from the application, you can turn off automatic failover.

When defining the failover group in an ARM template:

{
  "condition": "[equals(parameters('redundancyId'), 'pri')]",
  "type": "Microsoft.Sql/servers",
  "kind": "v12.0",
  "name": "[variables('sqlServerPrimaryName')]",
  "apiVersion": "2014-04-01-preview",
  "location": "[parameters('location')]",
  "properties": {
    "administratorLogin": "[parameters('sqlServerPrimaryAdminUsername')]",
    "administratorLoginPassword": "[parameters('sqlServerPrimaryAdminPassword')]",
    "version": "12.0"
  },
  "resources": [
    {
      "condition": "[equals(parameters('redundancyId'), 'pri')]",
      "apiVersion": "2015-05-01-preview",
      "type": "failoverGroups",
      "name": "[variables('sqlFailoverGroupName')]",
      "properties": {
        "serverName": "[variables('sqlServerPrimaryName')]",
        "partnerServers": [
          {
            "id": "[resourceId('Microsoft.Sql/servers/', variables('sqlServerSecondaryName'))]"
          }
        ],
        "readWriteEndpoint": {
          "failoverPolicy": "Automatic",
          "failoverWithDataLossGracePeriodMinutes": 60
        },
        "readOnlyEndpoint": {
          "failoverPolicy": "Disabled"
        },
        "databases": [
          "[resourceId('Microsoft.Sql/servers/databases', variables('sqlServerPrimaryName'), variables('sqlDatabaseName'))]"
        ]
      },
      "dependsOn": [
        "[variables('sqlServerPrimaryName')]",
        "[resourceId('Microsoft.Sql/servers/databases', variables('sqlServerPrimaryName'), variables('sqlDatabaseName'))]",
        "[resourceId('Microsoft.Sql/servers', variables('sqlServerSecondaryName'))]"
      ]
    },
    {
      "condition": "[equals(parameters('redundancyId'), 'pri')]",
      "name": "[variables('sqlDatabaseName')]",
      "type": "databases",
      "apiVersion": "2014-04-01-preview",
      "location": "[parameters('location')]",
      "dependsOn": [
        "[variables('sqlServerPrimaryName')]"
      ],
      "properties": {
        "edition": "[variables('sqlDatabaseEdition')]",
        "requestedServiceObjectiveName": "[variables('sqlDatabaseServiceObjective')]"
      }
    }
  ]
},
{
  "condition": "[equals(parameters('redundancyId'), 'pri')]",
  "type": "Microsoft.Sql/servers",
  "kind": "v12.0",
  "name": "[variables('sqlServerSecondaryName')]",
  "apiVersion": "2014-04-01-preview",
  "location": "[variables('sqlServerSecondaryRegion')]",
  "properties": {
    "administratorLogin": "[parameters('sqlServerSecondaryAdminUsername')]",
    "administratorLoginPassword": "[parameters('sqlServerSecondaryAdminPassword')]",
    "version": "12.0"
  }
}

I specify the readWriteEndpoint like this:

    "readWriteEndpoint": {
      "failoverPolicy": "Automatic",
      "failoverWithDataLossGracePeriodMinutes": 60
    }

With a failoverWithDataLossGracePeriodMinutes set to 60 minutes.

What does this mean? I cannot find a clear answer anywhere. Does it mean that:

  1. When an outage is happening in my primary region where my primary database resides, the read/write endpoint points to the primary and only after 60 minutes it fails over to my secondary, which becomes the new primary. In the 60 minutes, the only way to read my data is to use the readOnlyEndpoint directly? OR
  2. My read/write endpoint is turned instantly, if they somehow can detect that there was no data to be synced

I think it boils down to: do I have to manually make the failover, if I detect an outage, if I don't care about data loss, but I want to be able to write to my database?

Bonus question: is the reason why the grace period is present because there can be unsynced data on the primary, that will be overwritten, or tossed away, if the secondary becomes the new primary (if i switch manually)?

Sorry, I can't keep it to only one question. I have read a lot and I really need to know this.

like image 218
mslot Avatar asked Jun 15 '19 17:06

mslot


People also ask

What is failover group in Azure SQL?

A failover group is a named group of databases managed by a single server that can fail over as a unit to another Azure region in case all or some primary databases become unavailable due to an outage in the primary region.

How do you always failover on a cluster?

Use SQL Server Management Studio Expand the Always On High Availability node and the Availability Groups node. Right-click the availability group to be failed over, and select Failover.

What events happen when a failover occurs?

During the failover, the failover target takes over the primary role, recovers its databases, and brings them online as the new primary databases. The former primary replica, when available, switches to the secondary role, and its databases become secondary databases.

How long does Azure SQL failover take?

This process normally should complete within 30 seconds or less. There's a short period during which both databases are unavailable, on the order of 0 to 25 seconds, while the roles are switched.


Video Answer


1 Answers

What does this mean?

It means that:

"when a outage is happening in my primary region where my primary database resides, the read/write endpoint points to the primary and only after 60 minutes it fails over to my secondary, which becomes the new primary. "

It can't failover automatically even when the data is synced because the high-availability solution in the primary region is trying to do the same thing, and almost all of the time your primary database will come back quickly in the primary region. Performing an automatic cross-region fail-over would interfere with this.

And

"the reason why the grace period is present, is that because the there can be unsynced data on the primary, that will be overwritten, or tossed away, if the secondary becomes the new primary"

And to allow time for the database to failover within the primary region.

like image 84
David Browne - Microsoft Avatar answered Oct 22 '22 23:10

David Browne - Microsoft