SQL Server

SQL Server Always On NAT Support: Architecture & Multi-Subnet Failures

Disaster recovery plans often collide with network security policies. When extending SQL Server Always On Availability Groups across data centers, infrastructure teams frequently attempt to bridge the gap using Network Address Translation (NAT).

While a standard Ping command may succeed across these boundaries, the Windows Server Failover Cluster (WSFC) protocol will aggressively fail. This incompatibility stems from a fundamental mismatch between packet headers and the encrypted data payload, rendering standard firewall translations fatal to cluster stability.

Below is the technical evidence on why NAT breaks Quorum, the specific error codes found in the logs, and the only supported routing architectures allowed by Microsoft.

NAT & Multi-Subnet SQL Server Always On | GigXP.com
GigXP.com
Updated Dec 2025

NAT vs. Always On: Why Your Multi-Subnet Cluster Will Fail

Building a Microsoft SQL Server Always On Availability Group (AG) across two different sites is a standard requirement for disaster recovery. You have Subnet A (10.1.2.x) and Subnet B (10.1.5.x). The connection exists. The ping works. But if you introduce Network Address Translation (NAT) between these cluster nodes, the architecture violates Microsoft support requirements and technical mechanics.

Windows Server Failover Clustering (WSFC) demands a routed Layer 3 network. It expects the IP address in the packet header to match the IP address inside the data payload. NAT breaks this.

The Bottom Line

Microsoft Support guidelines explicitly state that NAT is not supported between cluster nodes or between nodes and Domain Controllers. You must route traffic transparently.

The Technical Failure Points

Support statements exist for a reason. NAT introduces specific mechanical failures in the WSFC protocols that “static mapping” cannot resolve.

1. The UDP Heartbeat Mismatch

Cluster nodes send heartbeats on UDP port 3343. These packets contain a payload with the sender’s configuration and identity. In a routed environment, the Source IP in the header matches the payload. When a NAT device alters the header IP but leaves the encrypted payload alone, the receiving node detects a mismatch. It interprets the packet as spoofed or malformed and discards it.

2. Kerberos Authentication & SPNs

Active Directory (AD) authentication relies on Kerberos. Service Principal Names (SPNs) are registered to specific IPs. If a node requests a ticket from behind a NAT, the ticket is valid for the NAT IP, not the node’s real IP. When the node presents this ticket to a File Share Witness or another node, authentication fails with KRB_AP_ERR_MODIFIED.

The Witness Blind Spot

Your File Share Witness (FSW) is the tie-breaker for quorum. If Node A and Node B cannot communicate due to NAT issues, they both race to lock the FSW file to claim majority.

However, NAT breaks the SMB Session State (Port 445) required for the witness lock. If the firewall kills the idle TCP session because the keep-alives were dropped, the cluster service believes the witness is offline.
Result: If the primary node crashes, the secondary node cannot count the witness vote, fails to reach quorum, and the entire database shuts down.

The ARP Cache Problem

Even if you fix authentication, failover relies on Gratuitous ARP (GARP). When the Primary IP fails over from Site A to Site B, the new server broadcasts a GARP to update the network tables.

Firewalls performing NAT often ignore GARP or hold aggressive ARP caches for the translation. This results in a “black hole” scenario where the database is online at Site B, but the network keeps sending traffic to the dead Site A for the duration of the firewall’s ARP timeout (often 4 hours by default).

Visualizing the Failure

This interactive diagram demonstrates how the packet inspection fails when NAT is introduced between Node A and Node B.

The “Catch-22” of DNS Registration

Beyond packet loss, NAT breaks the logic of the Always On Listener. The listener is a DNS name that clients use to connect.

Scenario A: Register Real IP

Action: The node registers its local NIC IP (10.1.2.10) in DNS.

Result: External clients resolve 10.1.2.10. Because they are behind a NAT, 10.1.2.10 is unreachable (private IP). Connection Fails.

Scenario B: Register NAT IP

Action: You manually register the NAT VIP (203.0.113.10) in DNS.

Result: The Cluster Service attempts to bring the resource online. It checks its local NICs, sees it does not own 203.0.113.10, and declares the resource failed. Cluster Fails.

Forensics: Reading the Cluster Log

If your networking team insists “the firewall is open,” you can prove the NAT/Routing issue by generating the Cluster Log. Open PowerShell as Admin and run Get-ClusterLog -Destination . -TimeSpan 15. Search the log for these specific indicators:

00001a2s.00003b4c:: [FTI][Initiator] Aborting connection because NetFT route to node 2 is faulty
00001a2s.00003b4c:: [IM] Sending handshake to remote 10.1.5.20:3343 failed with error 10060 (Connection Timed Out)
00002b3x.00004c5d:: [SEC] Security Context failed to retrieve target name for 10.1.5.20
00002b3x.00004c5d:: [PULL] Requesting resync because sequence numbers do not match (Packet Modification Detected)

Supported Target Network Matrix

To build a supported configuration, you must implement specific firewall rules without NAT. Use the filters below to isolate traffic flows by component.

Protocol Port Source Destination Purpose
UDP 3343 Nodes Nodes Cluster Heartbeat (Must Allow Fragments)
TCP 3343 Nodes Nodes Cluster State Sync
TCP 445 Nodes Witness / Nodes File Share Witness Quorum
ICMP N/A Nodes Nodes Echo Request/Reply (Required for Validation)
TCP 5022 SQL IP SQL IP Always On Data Replication
TCP 1433 SQL IP SQL IP TDS / Backup Coordination
TCP/UDP 88 Nodes Domain Controllers Kerberos Auth
TCP/UDP 389 Nodes Domain Controllers LDAP Queries
TCP/UDP 464 Nodes Domain Controllers Password Rotation (Required)
TCP 49152-65535 Nodes Domain Controllers RPC Dynamic Ports

Valid Alternatives (The “Right” Way)

If your networking team cannot provide a routed (non-NAT) path between sites, you cannot use a single Stretched WSFC. You must change the SQL architecture.

1. Distributed Availability Groups (DAG)

DAGs are the designed solution for this problem. A DAG allows you to have Cluster A at Site 1 and Cluster B at Site 2.

  • They do not share a quorum.
  • They do not share a cluster IP.
  • They do not require L2 adjacency.

The DAG simply forwards log blocks from Cluster A to Cluster B. Because they are independent clusters, the NAT issue between nodes disappears (though you still need to be careful with the forwarding ports).

2. Log Shipping

The “old reliable” method. It is file-based rather than session-based. If you can copy a file over SMB (port 445) or FTP from Site A to Site B, you can have DR. It is immune to most WSFC networking strictness.

Performance Penalties: The Hidden Cost

Even if you hack a solution together that bypasses the validation wizards, NAT devices introduce latency processing overhead that is fatal to Synchronous Commit availability groups.

In Synchronous Commit mode, the Primary replica cannot confirm a transaction to the user until it receives an acknowledgement (ACK) from the Secondary. Firewalls performing Deep Packet Inspection (DPI) or NAT translation add milliseconds to every round trip.

The Math:
An extra 5ms latency due to firewall inspection = 10ms round trip penalty per transaction.
If your app does 100 transactions/sec sequentially, you have just added 1 full second of delay to the user experience every second.
Symptom: High HADR_SYNC_COMMIT wait times in SQL DMVs.

The Network Admin Script

Copy this text and send it to your networking or firewall team to explain exactly what is needed to support the Microsoft architecture.

SUBJECT: Firewall Requirements for SQL Always On Cluster To the Network Team, We are deploying a Microsoft Windows Server Failover Cluster (WSFC) for the SQL Availability Group. This protocol has strict requirements regarding packet inspection and modification. For the path between [Site A Subnet] and [Site B Subnet], we require: 1. NO NAT (Network Address Translation). – The Source IP seen by the destination server MUST match the actual configured IP of the source server. – The WSFC heartbeat payload contains the IP address; if the packet header IP does not match the payload IP, the cluster service will drop the packet as malformed. 2. ALLOW FRAGMENTATION. – UDP 3343 packets (Heartbeat) are often fragmented. Please ensure the firewall does not drop UDP fragments between these hosts. 3. PRESERVE RPC. – We require the full RPC range (49152-65535) open to Domain Controllers for authentication tokens. If a routed, non-NAT path is not possible, we cannot proceed with a Stretched Cluster design and must pivot to a Distributed Availability Group (DAG), which will require additional hardware/licensing scope. Please confirm if a non-NAT route can be provisioned.

Pre-Flight Checklist

Frequently Asked Questions

Can I use NAT if I only use it for client connections?
Yes. NAT at the edge for client ingress is generally acceptable. The strict “No-NAT” rule applies specifically to the traffic between cluster nodes and traffic between nodes and the Domain Controllers.
What happens if I block ICMP?
While the cluster might technically run, the “Validate a Configuration” wizard will fail. This leaves your deployment in an unsupported state. Microsoft Support may refuse assistance until the validation passes.
Why do I need Port 464?
The Cluster Name Object (CNO) is a computer account in AD. It rotates its password every 30 days using the kpasswd protocol on port 464. If you block this, the cluster will eventually fail to authenticate, often weeks after deployment.
Does a multi-subnet listener require special client strings?
Yes. You must use MultiSubnetFailover=True in your connection string. This allows the client to attempt connections to all listener IPs in parallel. Without it, clients may wait 20+ seconds for a TCP timeout when the active node is in the secondary subnet.
Is SQL Basic Availability Groups supported over NAT?
No. Basic AGs still rely on the underlying Windows Server Failover Cluster (WSFC). If the WSFC fails due to NAT, the Basic AG fails with it.
Disclaimer: The Questions and Answers provided on https://gigxp.com are for general information purposes only. We make no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to the website or the information, products, services, or related graphics contained on the website for any purpose.

What's your reaction?

Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0

More in:SQL Server

Next Article:

0 %