IDDO COHEN

View Original

How VMware SD-WAN High-Availability Works?

A bit more than one year back, in a morning where the night before my newborn baby couldn’t sleep, I got a call from my customer: “Could you please elaborate me and give me the exact specification of your HA protocol?”

Tired and exhausted as I was, I answered: “You want what?”

After the third coffee and me struggling to answer some of his questions, I decided to go after it and “look under the hood”.

The below are some of my research outcomes and hopefully this will help you do some design decisions in your network.

What Types of HAs Does VMware SD-WAN Have?

Lets start with basics first, VMware SD-WAN solution has three types of high availability:

  • VMware SD-WAN Proprietary Standard and Enhanced HA

    Used on spokes locations for redundancy. Also if technically you could use it for hub locations, you usually want more horizontal scale-out.

  • VMware SD-WAN Proprietary Clustering

    Used on hub locations for horizontal scale-out. This will be a topic covered in future blog post but to note, horizontal scale-out does not necessary equate to redundancy (necessarily).

  • Industry standard VRRP

    Used on spokes locations and with other CE routers on site. VRRP cannot be used between VMware SD-WAN Edges, as VRRP does not provide possibility to synchronise all information we need between the edges.

The purpose of this blog post is to talk about the Standard and Enhanced HA.

Articles around Standard/Enhanced High-Availability can be found here:

How Does Standard HA Work?

Standard High Availability

In this model each edge has the exact same configuration and symmetric connections.

The active and standby edges are connected via GE1/LAN1 to exchange HA heartbeat and control information - like config updates, flow information and tunnel information.

Nothing else is sent on GE1/LAN1.

GE5, GE6 and GE3 are connected physically to the same underneath switches.

The ports on the active edge are open for receiving/sending packets.

The ports on the standby edge, other than GE1/LAN1, are in no sending and no receiving state.

The running software ensures that no traffic is processed when coming in or out towards the wire but the ports are layer 1 wise switched on.

What Happens in Fail-Over Event for Standard HA?

In the event that the active edge fails (in software or hardware), the standby edge will tell the gateway: “hey, I want to continue using the same tunnel with flow configuration, so no one notices there was a failure?”

At switch-over, the standby edge:

  • will use the same logical identifier (logical ID) as the active edge used inside the IPsec tunnel. Such that, the gateway believes that the same edge is talking to it.

  • will send inside the IPSec tunnel a TLV message within our VCMP protocol to signalise the gateway that it will take over.

From gateway perspective thereafter nothing has changed and hence a quicker fail-over can be ensured.

Please do not confuse quick fail over with quick convergence. Convergence in data-plane might take time depend on different factors like no. of routes, no. of interfaces, no. of flows, no. of firewall rules, etc.

How Does Standard HA Look and Feel?

From Management-plane Perspective

The edges are a mirror image of each other and as such you will see that the orchestrator will provide configuration for “one edge”.

Any changes you do on this edge on the orchestrator e.g. on the interfaces, will be synchronised to the standby edge.

This means, the orchestrator does not communicate with standby edge.

From Control-plane and Data-plane Perspective

Also, the standby edge does not participate in any data-plane communication (aka routing, tunnel creation, traffic steering identification, etc.).

To ensure network constancy we also mirror the same mac addresses of all interfaces of the active towards the standby (of course for only those who are up/live).

This means, when fail-over occurs the switching infrastructure will learn the MAC address of the new active edge on the different switching port.

We also send gratuitous ARP from the standby edge at switch-over to ensure quicker Layer 2 network convergence.

You must ensure to disable any concepts like port-security on the switches so no blocking occurs when the standby edge takes over - I have spend hours in POCs to troubleshoot why the standby hasn’t taken over to figure out that port-security was involved.

How Does Enhanced HA Work?

Enhanced High-Availability

In this model each edge has the same configuration but different connections.

The key the active edge can use GE6 overlay and underlay.

The use-case is simple, no switching infrastructure is needed on the WAN.

Same as in standard HA, the active and standby edges are connected via GE1/LAN1 to exchange HA heartbeat and control information.

However other than standard HA, the active edge will now use GE1/LAN1 to send traffic towards standby edge to use GE6. The standby will still stay standby, and only forwards the traffic of the active edge.

All control-plane and data-plane (e.g. tunnel creation and forwarding) will be done on the active edge.

The ports on the active edge are open for receiving and sending packets.

As in standard HA, all ports on the standby edge, other than GE1/LAN1 and GE6 now, are in no sending and no receiving state.

What Happens in Fail-Over Event for Enhanced HA?

It behaves absolutely the same as in the standard HA mode.

At switch-over, the standby HA will use the logical-id of the active edge and tell the gateway that it will take over the tunnel but it can only take over the tunnel that went over GE6 from the active edge.

This means depending on the DMPO configuration, different applications might get affected when a switch-over occurs (as one of the tunnel does not exist anymore).

How Does Enhanced HA Look and Feel?

From Management-plane Perspective

Also here the behaviour is the same as standard HA, with a small difference, that now GE5 and GE6 will show up for configuration in the orchestrator under the same edge.

The active edge knows, based on the synchronisation between the standby, that one interface does not belong to it.

No extra configuration needs to be done on the orchestrator to “activate enhanced HA” - it happens automatically.

From Control-plane and Data-plane Perspective

The standby is still passive and only forwards the packets from the active edge received on GE1/LAN1 towards GE6 (and vice versa).

The standby edge behaves absolutely the same in regards of not receiving and sending packets for all unused ports.

But now to the key question, if the standby edge does not participate in the control and data-plane, how does standby edge know to forward the packets of the active edge towards GE6?

This is done by encapsulating the information which VLAN and interface the edge should use, within the ethernet header.

The active edge does it when the packet gets forwarded towards the standby and the standby edge does it for the return traffic coming in from GE6.

This is done, so the standby edge does not need to do any additional computation - as it strips of the ethernet header anyhow for HA based information.

So Two Softwares Exists?

No, there is no difference in software between standard and enhanced HA - all algorithms are the same.

The key difference is only, as soon as the active edge wants to send the packet out onto the wire, the network driver on the active edge realises: “Oh, this interface does not belong to me, let me send it to the standby” - that is it.

Summary

The VMware SD-WAN edges support standard and enhanced HA.

In standard HA edges have symmetrical connections whereby in enhanced HA asymmetrical connection.

Standard and enhanced HA are using the same mechanisms and there are no two software modes nor two configuration modes on the orchestrator.

Standard and enhanced HA fail-over the same way.

Let me know what you think in the comment section.